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Abstract : We consider the problem of predicting as well as the best linear combi- 
nation of d given functions in least squares regression, and variants of this problem includ- 
ing constraints on the parameters of the linear combination. When the input distribution 
is known, there already exists an algorithm having an expected excess risk of order d/n, 
where n is the size of the training data. Without this strong assumption, standard results 
often contain a multiplicative log n factor, and require some additional assumptions like 
uniform boundedness of the d-dimensional input representation and exponential moments 
of the output. 

This work provides new risk bounds for the ridge estimator and the ordinary least 
squares estimator, and their variants. It also provides shrinkage procedures with conver- 
gence rate d/n (i.e., without the logarithmic factor) in expectation and in deviations, un- 
der various assumptions. The key common surprising factor of these results is the absence 
of exponential moment condition on the output distribution while achieving exponential 
deviations. All risk bounds are obtained through a PAC-Bayesian analysis on truncated 
differences of losses. Finally, we show that some of these results are not particular to the 
least squares loss, but can be generalized to similar strongly convex loss functions. 
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Introduction 

Our statistical task. Let Zi = (Xi, Yi), . . . , Z„ = he n > 2 

pairs of input-output and assume that each pair has been independently drawn 
from the same unknown distribution P. Let X denote the input space and let the 
output space be the set of real numbers R, so that P is a probability distribution 
on the product space 2. = X x R. The target of learning algorithms is to predict 
the output Y associated with an input X for pairs Z = (X, Y) drawn from the 
distribution P. The quality of a (prediction) function / : X — i- R is measured by 
the least squares risk: 

Rif)^Ez^p{[Y-fiX)f}. 

Through the paper, we assume that the output and all the prediction functions we 
consider are square integrable. Let 6 be a closed convex set of R"^, and cpi, . . . ,ipd 
be d prediction functions. Consider the regression model 

^ i=i ^ 



3 



The best function /* in 5" is defined by 



Such a function always exists but is not necessarily unique. Besides it is unknown 
since the probability generating the data is unknown. 

We will study the problem of predicting (at least) as well as function /*. In 
other words, we want to deduce from the observations Zi, . . . , Z„ a function / 
having with high probability a risk bounded by the minimal risk R{f*) on 5" plus a 
small remainder term, which is typically of order (i/n up to a possible logarithmic 
factor. Except in particular settings (e.g., is a simplex and d > \/n), it is known 
that the convergence rate d/n cannot be improved in a minimax sense (see GOll . 
and [l2n for related results). 

More formally, the target of the paper is to develop estimators / for which the 
excess risk is controlled in deviations, i.e., such that for an appropriate constant 
K > 0, for any e > 0, with probability at least I — e, 

R{f)-R{n<^^^^^^^. (0.1) 

n 

Note that by integrating the deviations (using the identity EW = f^"^ F(W > 
t)dt which holds true for any nonnegative random variable W), Inequality (10.11) 
implies 

Ei?(/) - R{n < k'^. (0.2) 

n 

In this work, we do not assume that the function 

f''^^ : X ^ E[r|X = x], 

which minimizes the risk R among all possible measurable functions, belongs to 
the model 5". So we might have /* 7^ f^^'^^^ and in this case, bounds of the form 

Ei?(/) - R{f''^^) < C[R{f*) - Rif''^^)] + K^, (0.3) 

with a constant C larger than 1 do not even ensure that Ei?(/) tends to R{f*) 
when n goes to infinity. This kind of bounds with C > 1 have been developed 
to analyze nonparametric estimators using linear approximation spaces, in which 
case the dimension d is a function of n chosen so that the bias term R{f*) — 
R{f^^^^^) has the order d/n of the estimation term (see [11] and references within). 
Here we intend to assess the generalization ability of the estimator even when the 
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model is misspecified (namely when R{f*) > Moreover we do not 

assume either that Y — f^^'^^\X) and X are independent. 

Notation. When = E'', the function /* and the space 5" will be written 
/4 and 5iin to emphasize that 5" is the whole linear space spanned hy (pi, ... ^cpd,: 

J'lin = span{(/?i, ...,ipd} and e argmin R(f). 

The Euclidean norm will simply be written as || ■ ||, and (■, ■) will be its associated 
inner product. We will consider the vector valued function (p : X ^ Mf' defined 
by ^p{X) = [cpk{X)] so that for any 6' G 6, we have 

fo(x)^{e,^(x)). 

The Gram matrix is the d x d-matrix Q = E [(p{X)(p{X)'^] , and its smallest and 
largest eigenvalues will respectively be written as ^min and gmax- The empirical 
risk of a function / is 

1 " 

i=l 

and for A > 0, the ridge regression estimator on 3^ is defined by /("''^^^ = /,9(ridge) 
with 

^(ridge) g argminr(/e) + A||^||^ 

where A is some nonnegative real parameter. In the case when A = 0, the ridge 
regression /("^se) nothing but the empirical risk minimizer In the same 

way, we introduce the optimal ridge function optimizing the expected ridge risk: 
f = f§ with 

e e arg mm {Rife) + X\\9f}. (0.4) 

Finally, let Qx = Q + XI he the ridge regularization of Q, where / is the identity 
matrix. 

Why should we be interested in this task. There are three main rea- 
sons. First we aim at a better understanding of the parametric linear least squares 
method (classical textbooks can be misleading on this subject as we will point out 
later), and intend to provide a non- asymptotic analysis of it. 

Secondly, the task is central in nonparametric estimation for linear approxima- 
tion spaces (piecewise polynomials based on a regular partition, wavelet expan- 
sions, trigonometric polynomials. . . ) 

Thirdly, it naturally arises in two-stage model selection. Precisely, when fac- 
ing the data, the statistician has often to choose several models which are likely to 
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be relevant for the task. These models can be of similar structures (like embedded 
balls of functional spaces) or on the contrary of very different nature (e.g., based 
on kernels, splines, wavelets or on parametric approaches). For each of these 
models, we assume that we have a learning scheme which produces a 'good' pre- 
diction function in the sense that it predicts as well as the best function of the 
model up to some small additive term. Then the question is to decide on how 
we use or combine/aggregate these schemes. One possible answer is to split the 
data into two groups, use the first group to train the prediction function associated 
with each model, and finally use the second group to build a prediction function 
which is as good as (i) the best of the previously learnt prediction functions, (ii) 
the best convex combination of these functions or (iii) the best linear combination 
of these functions. This point of view has been introduced by Nemirovski in [fTTll 
and optimal rates of aggregation are given in [|20ll and references within. This pa- 
per focuses more on the linear aggregation task (even if (ii) enters in our setting), 
assuming implicitly here that the models are given in advance and are beyond our 
control and that the goal is to combine them appropriately. 

Outline and contributions. The paper is organized as follows. Section[I] 
is a survey on risk bounds in linear least squares. Theorems 11.31 and 11.51 are the 
results which come closer to our target. Section [2] provides a new analysis of 
the ridge estimator and the ordinary least squares estimator, and their variants. 
Theorem 12 . 1 1 provides an asymptotic result for the ridge estimator while Theorem 
12.21 gives a non asymptotic risk bound of the empirical risk minimizer, which is 
complementary to the theorems put in the survey section. In particular, the result 
has the benefit to hold for the ordinary least squares estimator and for heavy- 
tailed outputs. We show quantitatively that the ridge penalty leads to an implicit 
reduction of the input space dimension. Section [3] shows a non asymptotic d/n 
exponential deviation risk bound under weak moment conditions on the output Y 
and on the rf-dimensional input representation '^{X). Section |4] presents stronger 
results under boundedness assumption of '^{X). However the latter results are 
concerned with a not easily computable estimator. Section [5] gives risk bounds for 
general loss functions from which the results of Section |4] are derived. 

The main contribution of this paper is to show through a PAC-Bayesian anal- 
ysis on truncated differences of losses that the output distribution does not need 
to have bounded conditional exponential moments in order for the excess risk of 
appropriate estimators to concentrate exponentially. Our results tend to say that 
truncation leads to more robust algorithms. Local robustness to contamination 
is usually invoked to advocate the removal of outliers, claiming that estimators 
should be made insensitive to small amounts of spurious data. Our work leads 
to a different theoretical explanation. The observed points having unusually large 
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outputs when compared with the (empirical) variance should be down-weighted 
in the estimation of the mean, since they contain less information than noise. In 
short, huge outputs should be truncated because of their low signal to noise ratio. 



1. Variants of known results 



1.1. Ordinary least squares and empirical risk minimization. The 
ordinary least squares estimator is the most standard method in this case. It mini- 
mizes the empirical risk 



1 " 

'' ■ 1 
1=1 

among functions in 3^ii„ and produces 

with 9''°^^^ = [0^°^^^]'j^^ a column vector satisfying 

X^x ^(ok) = x^Y, (1.1) 
where Y = [y^]"=i and X = {'fj{Xi))i<i<n,i<j<d- It is well-known that 

• the linear system (11.11) has at least one solution, and in fact, the set of so- 
lutions is exactly {X'^ Y +u; u E ker X}; where X^ is the Moore-Penrose 
pseudoinverse of X and ker X is the kernel of the linear operator X. 

• X is the (unique) orthogonal projection of the vector Y G R" on the 
image of the linear map X; 

• if sup^gx Var(F|X = x) = cr^ < +oo, we have (see [fTTl Theorem 11.1]) 
for any Xi , . . . , X„ in X, 



E 



r 1 " 

4=1 



Xi , . . . , Xr, 



/e?ii„ n ^ n n 

i=l 

where we recall that p^^^'' : x i— i- E[F|X = x] is the optimal regression 
function, and that when this function belongs to S'nn (i.e., /^''^^^ = /[*„), the 
minimum term in (11.21) vanishes; 
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• from Pythagoras' theorem for the (semi)norm W t-)- vEW^ on the space 
of the square integrable random variables, 

(1.3) 

The analysis of the ordinary least squares often stops at this point in classical 
statistical textbooks. (Besides, to simplify, the strong assumption f^^^^^ = /j* 



lin 



is often made.) This can be misleading since Inequality (11.21) does not imply a 
d/n upper bound on the risk of /(°'^\ Nevertheless the following result holds ifTTl 
Theorem 11.3]. 

Theorem 1.1 If sup^gx Var(F|X = x) = < +oo and 

||/(-s)|U = sup|/C-^s)(a;)| 

for some H > 0, then the truncated estimator /^'^^^ = (/("'^^^ A if ) V —H satisfies 

for some numerical constant k. 

Using PAC-Bayesian inequalities, Catoni fS^, Proposition 5.9.1] has proved a 
different type of results on the generalization ability of 

Theorem 1 .2 Let 1' C 3^iin satisfying for some positive constants a, M, M' : 

• there exists /o G 3^' s.t. for any x G X, 

Ejexp a|F-/o(X)| X = x} < M. 

• for any /i, /a G "J', sup^^x \h{x) ' f2{x)\ < M' . 

Let Q = Fi[ip{X)ip{X)'^~\ and Q = [^'^'^=if{Xi)ip{Xi)'^~\ be respectively the 
expected and empirical Gram matrices. If det Q ^ 0, then there exist positive 
constants Ci and C2 (depending only on a, M and M') such that with probability 
at least 1 — e, as soon as 

|/ G :?H„ : r(/) < r(/>'^)) + C^i^j C 3^', (1.5) 

we have 
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This result can be understood as follows. Let us assume we have some prior 
knowledge suggesting that belongs to the interior of a set 3^' C S^un (e.g., 
a bound on the coefficients of the expansion of as a linear combination of 
(/?!,..., It is likely that (|1.5I) holds, and it is indeed proved in Catoni [[8l 
section 5.11] that the probability that it does not hold goes to zero exponentially 
fast with n in the case when 3^' is a Euclidean ball. If it is the case, then we know 
that the excess risk is of order d/n up to the unpleasant ratio of determinants, 
which, fortunately, almost surely tends to 1 as n goes to infinity. 

By using localized PAC-Bayes inequalities introduced in Catoni (71 HI, one can 
derive from Inequality (6.9) and Lemma 4.1 of Alquier [IJ the following result. 

Theorem 1.3 Let gmin be the smallest eigenvalue of the Gram matrix Q = 
E[y9(X)(y9(X)^]. Assume that there exist a function fo G 3^iin and positive con- 
stants H and C such that 

ll/iL - /oiloo < H. 

and \ Y\ < C almost surely. 

Then for an appropriate randomized estimator requiring the knowledge of fo, 
H and C, for any e > with probability at least 1 — e w.r.t. the distribution 
generating the observations Zi, . . . , Z„ and the randomized prediction function 
f, we have 

Rif) - RifL) < + C^) ^^"g(3^'^^n)+log((logn).-^) ^ ^^ ^^ 

n 

for some k, not depending on d and n. 

Using the result of (81 Section 5.1 1], one can prove that Alquier's result still 
holds for / = f^°^^\ but with k also depending on the determinant of the prod- 
uct matrix Q. The log[log(n)] factor is unimportant and could be removed in 
the special case quoted here (it comes from a union bound on a grid of pos- 
sible temperature parameters, whereas the temperature could be set here to a 
fixed value). The result differs from Theorem 11.21 essentially by the fact that 
the ratio of the determinants of the empirical and expected product matrices has 
been replaced by the inverse of the smallest eigenvalue of the quadratic form 
d R{J2'j=i^jfj) ~ Rif lin)- the case when the expected Gram matrix is 
known, (e.g., in the case of a fixed design, and also in the slightly different context 
of transductive inference), this smallest eigenvalue can be set to one by choosing 
the quadratic form 9 i-)- R{fe) — Rifim) to define the Euclidean metric on the 
parameter space. 

Localized Rademacher complexities (T3l l4ll allow to prove the following prop- 
erty of the empirical risk minimizer. 



9 



Theorem 1 .4 Assume that the input representation ip{X), the set of parameters 
and the output Y are almost surely bounded, i.e., for some positive constants H 
and C, 

sup ||6'|| < 1 
ess sup \\ip{X)\\ < H, 

and 

\Y\ < C a.s.. 

Let Ui > ■ ■ ■ > Ucibe the eigenvalues of the Gram matrix Q = E[(y9(X)y9(X)^]. 
The empirical risk minimizer satisfies for any e > 0, with probability at least 1—e: 



< k{H + C) 



n 

rank(Q) + \og{e-^) 



n 

where k is a numerical constant. 

Proof. The result is a modified version of Theorem 6.7 in [4J applied to the linear 
kernel k{u, v) = {u, v) / (H + C)^. Its proof follows the same lines as in Theorem 
6.7 mutatis mutandi: Corollary 5.3 and Lemma 6.5 should be used as intermediate 
steps instead of Theorem 5.4 and Lemma 6.6, the nonzero eigenvalues of the 
integral operator induced by the kernel being the nonzero eigenvalues of Q. □ 

When we know that the target function is inside some ball, it is natural 
to consider the empirical risk minimizer on this ball. This allows to compare 
Theorem 1 1.41 to excess risk bounds with respect to 

Finally, from the work of Birge and Massart O, we may derive the following 
risk bound for the empirical risk minimizer on a ball (see Appendix|B]). 

Theorem 1.5 Assume that J has a diameter H for L'^-norm, i.e., for any fi, f2 
in 5", sup^gj — /2(x)| < H and there exists a function /o G 5" satisfying 

the exponential moment condition: 

foranyxeX, E^^exp A'^Y - fo{X)\ X = < M, (1.7) 

for some positive constants A and M. Let 



where the infimum is taken with respect to all possible orthonormal basis of 3^ for 
the dot product (/i, /2) = E/i(X)/2(X) (when the set 5" admits no basis with 
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exactly d functions, we set B = +00). Then the empirical risk minimizer satisfies 
for any £ > 0, with probability at least 1 — e: 

?(ermh ^ ,2 , rj2,d\og{2 + {B / u) A {u/d)] + log(g~l) 



where k is a positive constant depending only on M. 

This result comes closer to what we are looking for: it gives exponential devi- 
ation inequalities of order at worse d\og{n/d)/n. It shows that, even if the Gram 
matrix Q has a very small eigenvalue, there is an algorithm satisfying a conver- 
gence rate of order d\og{n/d)/n. With this respect, this result is stronger than 
Theorem ll.3[ However there are cases in which the smallest eigenvalue of Q is 
of order 1, while B is large (i.e., B ^ n). In these cases. Theorem 1 1.3 1 does not 
contain the logarithmic factor which appears in Theorem 1 1.51 

1 .2. Projection estimator. When the input distribution is known, an alter- 
native to the ordinary least squares estimator is the following projection estima- 
tor. One first finds an orthonormal basis of for the dot product (/i, = 
E/i(X)/2(X), and then uses the projection estimator on this basis. Specifically, 
if 01, . . . , 0d form an orthonormal basis of J'lin, then the projection estimator on 
this basis is: 



j(proj) ^ ^ ^(proj) 

with 

n 



n 
1=1 



Theorem 4 in [[20l| gives a simple bound of order d/n on the expected excess risk 
Ei?(/Voj))_i^(/*j. 

1.3. Penalized least squares estimator. It is well established that pa- 
rameters of the ordinary least squares estimator are numerically unstable, and that 
the phenomenon can be corrected by adding an penalty ( [[TSlfTSl '). This solu- 
tion has been labeled ridge regression in statistics diTll ). and consists in replacing 



G argmin (r(/e) + A V 



where A is a positive parameter. The typical value of A should be small to avoid 
excessive shrinkage of the coefficients, but not too small in order to make the 
optimization task numerically more stable. 
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Risk bounds for this estimator can be derived from general results concerning 
penalized least squares on reproducing kernel Hilbert spaces (B), but as it is 
shown in Appendix O this ends up with complicated results having the desired 
d/n rate only under strong assumptions. 

Another popular regularizer is the norm. This procedure is known as Lasso 
[[T9ll and is defined by 

^(lasso) ^ ^gj^-j^ L (y^^ + A V l^j l j. 

As the penalty, the penalty shrinks the coefficients. The difference is that 
for coefficients which tend to be close to zero, the shrinkage makes them equal to 
zero. This allows to select relevant variables (i.e., find the j's such that 0* 7^ 0). 
If we assume that the regression function p^^^'' is a linear combination of only 
d* <^ d variables/functions the typical result is to prove that the risk of the 
Lasso estimator for A of order a/ (log d)/n is of order (d* \ogd)/n. Since this 
quantity is much smaller than d/n, this makes a huge improvement (provided 
that the sparsity assumption is true). This kind of results usually requires strong 
conditions on the eigenvalues of submatrices of Q, essentially assuming that the 
functions ipj are near orthogonal. We do not know to which extent these conditions 
are required. However, if we do not consider the specific algorithm of Lasso, but 
the model selection approach developed in [IJ, one can change these conditions 
into a single condition concerning only the minimal eigenvalue of the submatrix of 
Q corresponding to relevant variables. In fact, we will see that even this condition 
can be removed. 

I . 4. Conclusion of the survey. Previous results clearly leave room to im- 
provements. The projection estimator requires the unrealistic assumption that the 
input distribution is known, and the result holds only in expectation. Results using 

or regularizations require strong assumptions, in particular on the eigenval- 
ues of (submatrices of) Q. Theorem 11.11 provides a {d\ogn)/n convergence rate 
only when the R{fl^ — R{f^^^^^) is at most of order (d\ogn)/n. Theorem 11.21 
gives a different type of guarantee: the d/n is indeed achieved, but the random 
ratio of determinants appearing in the bound may raise some eyebrows and forbid 
an explicit computation of the bound and comparison with other bounds. Theorem 

II. 3l seems to indicate that the rate of convergence will be degraded when the Gram 
matrix Q is unknown and ill-conditioned. Theorem 1 1.41 does not put any assump- 
tion on Q to reach the d/n rate, but requires particular boundedness constraints 
on the parameter set, the input vector V5(X) and the output. Finally, Theorem 
11.51 comes closer to what we are looking for. Yet there is still an unwanted loga- 
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rithmic factor, and the result holds only when the output has uniformly bounded 
conditional exponential moments, which as we will show is not necessary. 



2. Ridge regression and empirical risk minimization 
We recall the definition 

d 

i=i 

where 6 is a closed convex set, not necessarily bounded (so that 6 = R'' is 
allowed). In this section, we provide exponential deviation inequalities for the 
empirical risk minimizer and the ridge regression estimator on 3" under weak con- 
ditions on the tail of the output distribution. 

The most general theorem which can be obtained from the route followed in 
this section is Theorem 16.51 (page l46l) stated along with the proof. It is expressed 
in terms of a series of empirical bounds. The first deduction we can make from 
this technical result is of asymptotic nature. It is stated under weak hypotheses, 
taking advantage of the weak law of large numbers. 

Theorem 2 . 1 For A > 0, let f be its associated optimal ridge function ( see 
(|0.4I) ). Let us assume that 

E[MX)\\^] <+oo, (2.1) 
and E|||(^(X)f [/(X) -r]^} < +00. (2.2) 

Let Ui, . . . ,1^^ be the eigenvalues of the Gram matrix Q = E[y9(X)y9(X)-^], and 
let Qx = Q + XI be the ridge regularization ofQ. Let us define the effective ridge 
dimension 

d 

^ = > 0) = + ^^y'Q\ = miQ'x'^Mxw]. 

1=1 

When X = 0, D is equal to the rank ofQ and is otherwise smaller. For any e > 0, 
there is n^, such that for any n > n^, with probability at least 1 — e, 

^(jVdge)^ ^^||^(ridge)||2 

<mm{R{fe) + X\\er} 

^ 30E{||g-%(x)f [/(x)-y]^} D 
E{||g-%(x)p} 
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E 



+ 1000 sup 
<mm{i?(/,) + 



{v,^{Xr[f{X)-Y]- 



log(3£- 



n 



+ esssupE{[r-/(X)]2|X} 



mv,^ixm + x\\v\\^ 

SOD + 1000 log(3e-i^ 



n 



Proof. See Section[0(pagel40l). □ 

This theorem shows that the ordinary least squares estimator (obtained when 
6 = R'^ and A = 0), as well as the empirical risk minimizer on any closed 
convex set, asymptotically reaches a d/n speed of convergence under very weak 
hypotheses. It shows also the regularization effect of the ridge regression. There 
emerges an ejfective dimension D, where the ridge penalty has a threshold effect 
on the eigenvalues of the Gram matrix. 

On the other hand, the weakness of this result is its asymptotic nature : 
may be arbitrarily large under such weak hypotheses, and this shows even in the 
simplest case of the estimation of the mean of a real valued random variable by its 
empirical mean (which is the case when d = 1 and fiX) = 1). 

Let us now give some non asymptotic rate under stronger hypotheses and for 
the empirical risk minimizer (i.e., A = 0). 

Theorem 2.2 Let d' = rank((5). Assume that 

E{[r-r(X)]^} <+oo 

and 

sup miinfixf] < +00. 



B 



/espan{</3i,...,i/Jd}-{0} 

Consider the (unique) empirical risk minimizer p'^™'^ = /^(erm) : x i— )■ {9^'^™^\ip{x)) 
on J' for which 6^^™^ G span{(/p(Xi), . . . ,(y9(X„)}0. For any values of e and n such 
that 2/n < 6 < 1 and 

3Bd' + log(2/e) + '^"^ 



n > 12805^ 
with probability at least I — e, 



n 



< 1920 5VE[r-/*(X)]4 



3Bd' + log(2£-i) fABd' 



n 



\ n 



4 When J = Jiin, we have = X+ Y, with X 
X"*" is the Moore-Penrose pseudoinverse of X. 



{ipj{Xi))i<,<ri,i<]<d, Y = and 
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Proof. See Section [Q (page l40l) . □ 

It is quite surprising that the traditional assumption of uniform boundedness 
of the conditional exponential moments of the output can be replaced by a simple 
moment condition for reasonable confidence levels (i.e., e > 2/n). For high- 
est confidence levels, things are more tricky since we need to control with high 
probability a term of order [r(/*) — R{f*)]d/n (see Theorem 16.61) . The cost to 
pay to get the exponential deviations under only a fourth-order moment condition 
on the output is the appearance of the geometrical quantity i? as a multiplicative 
factor, as opposed to Theorems 11.31 and 1 1 .51 More precisely, from [5, Inequality 
(3.2)], we have B < B < Bd, but the quantity B appears inside a logarithm in 
Theorem 1 1.51 However, Theorem 1 1.51 is restricted to the empirical risk minimizer 
on a ball, while the result here is valid for any closed convex set 6, and in 
particular applies to the ordinary least squares estimator. 

Theorem l2.2l is still limited in at least three ways: it applies only to uniformly 
bounded >^{X), the output needs to have a fourth moment, and the confidence 
level should be as great as e > 2/n. These limitations will be addressed in the 
next sections by considering more involved algorithms. 



3.1. The min-max estimator and its theoretical guarantee. This 
section provides an alternative to the empirical risk minimizer with non asymp- 
totic exponential risk deviations of order d/niox any confidence level. Moreover, 
we will assume only a second order moment condition on the output and cover 
the case of unbounded inputs, the requirement onLp{X) being only a finite fourth 
order moment. On the other hand, we assume that the set 6 of the vectors of co- 
efficients is bounded. The computability of the proposed estimator and numerical 
experiments are discussed at the end of the section. 

Let a > 0, A > 0, and consider the truncation function: 



We recall f = f§ with 9 G argmingge {Rife) + and the effective ridge 



3. A MIN-MAX estimator FOR ROBUST ESTIMATION 




For any 9,6' G 0, introduce 



n 



i=l 
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dimension 

d 



. 1 ^i + y 

1=1 



Let us assume in this section that for any j G 

E{<^,(X)2[r - /(X)]2} < +00, (3.1) 



and 

Define 



E[(^|(X)] < +CX). (3.2) 
{/ G 3^H„ : E[/(X)2] = 1}, (3.3) 



cr 



E{[r-/(X)]2} = Ji?(/), (3.4) 



X = maxv/E[/(X)4], (3.5) 



_ /E{[y.(X)^g-V(X)]^} 

E[^(X)^Q-V(X)^ ' ^^-^^ 



iE{[y-/W]^} ^ VE{[y- /(x)]^} 
E{[r-/(x)]2} 



= = ^ , (3.7) 



T = max VAlie - ^'P + E[/,(X) - fe\X)f. (3.8) 



Theorem 3 . 1 Le? m5 assume that (13.11) anJ (13.21) hold. For some numerical con- 
stants c and c', for 

n > CKxD, 

by taking 

1 f CKxD\ 



a= . ^ -2 1-— ^ , (3.9) 



n 



for any estimator fg satisfying 9 G Q a.s., for any e > and any A > 0, with 
probability at least 1 — e, we have 

i?(/,-) + A||^f <min{i?(/,) + A||ef} 



+ — ( maxl»(^,0i) - inf max D(6, OA 
na \6»iee eee Siee J 



7i \ — ^'•^XD 



16 



Proof. See Section [Q (page 1501) . □ 
By choosing an estimator such that 

max Die, 9i) < inf max DiO, Oi) + a^ — 



Theorem 13.11 provides a non asymptotic bound for the excess (ridge) risk with a 
D/n convergence rate and an exponential tail even when neither the output Y nor 
the input vector ifiX) has exponential moments. This stronger non asymptotic 
bound compared to the bounds of the previous section comes at the price of re- 
placing the empirical risk minimizer by a more involved estimator. Section 13.31 
provides a way of computing it approximately. 



3.2. The value of the uncbntbred kurtosis coefficient x- Let us 
discuss here the value of constant which plays a critical role in the speed of 
convergence of our bound. With the convention ^ = 0, we have 

Let us first examine the case when v^i(X) = 1 and [(^j{X),j = 2, . . . ,d\ are 
independent. To compute x, we can assume without loss of generality that they 
are centered and of unit variance, which will be the case after Q^^^^ is applied to 
them. In this situation, introducing 



X* = max 



1/2 



we see that for any u E H'^ with ||n|| = 1, we have 



i=l 



E{{u,^iX))')=J2^tnV^iXr) + 6 uy^H^,iXf]E[^,iXf] 

l<i<j<d 

d 



i=2 



< 5Z + 6 5^ uy^ + ^xT Xli^i^^ 



< sup {xl-S)J2< + 3ij2^^] +Axl'\iY. 

"6Rl,||ii|| = l 



i=l 



i=l 



i=2 



17 



Thus in this case 



X< 



< 



1/2 



33/2 



3+ 



X^>3, 
1 < < 3. 



3 + ^xP + ^ 



1/2 



X* > V^: 

1<X*<V3. 



If moreover the random variables ^j{X) are not skewed, in the sense that 

E [cpj{Xf] = 0, J = 2, . . . , then 



X = X* 



X< 3 + 



1/2 



X* > V^, 
1 < X* < \/3. 



In particular in the case when ^j{X) are Gaussian variables, x = X* = (as 
could be seen in a more straightforward way, since in this case {u, <f{X)) is also 
Gaussian !). 

In particular, this situation arises in compress sensing using random projec- 
tions on Gaussian vectors. Specifically, assume that we want to recover a signal 
/ e that we know to be well approximated by a linear combination of d 
basis vectors fi, . . . , f^. We measure n <^ M projections of the signal / on 
i.i.d. M-dimensional standard normal random vectors JXi, . . . , X^. Yi = (/, Xj), 
i — 1, . . . ,n. Then, recovering the coefficient 9i, . . . ,64 such that / = J2'j=i ^jfj 
is associated to the least squares regression problem Y ^ Ylj=i (^j'^jiX), with 
(Pj{x) — {fj, x), and X having a M-dimensional standard normal distribution. 

Let us discuss now a bound which is suited to the case when we are using a 
partial basis of regression functions. The functions Lpj are usually bounded (think 
of the Fourier basis, wavelet bases, histograms, splines ...). 

Let us assume that for some positive constant A and any u e R'', 

■ <AW.[{u,^{X))']"\ 



\u\ 



This appears as some stability property of the partial basis ^pj with respect to the 
L2-norm, since it can also be written as 



Y^u]<A-^ 



^Uj^j{X)\ 

7 = 1 ^ 



u e R . 



This will be the case if ipj is nearly orthogonal in the sense that 

l-A^ 



W.[^p^{Xf]>l, and ^[^^{X)^k{X)\ 



< 



d 
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In this situation, by using 

E[{u,^{X)y] < ||wf esssup||^(X)fE[(w,y.(X))2], 
one can check that 




Therefore, if X is the uniform random variable on the unit interval and cpj, j = 
1, . . . ,d are any functions from the Fourier basis (meaning that they are of the 
form v^cos(2A;7rX) or -y2sin(2/c7rX)), then x ^ V^d (because they form an 
orthogonal system, so that A = 1). 

On the other hand, a localized basis like the evenly spaced histogram basis of 
the unit interval 

ip,{x) = Vdil^x e [{j-i)/d,j/d[^, 

will also be such that x ^ V^- Similar computations could be made for other 
local bases, like wavelet bases. Note that when x is of order \/d. Theorem 13. II 
means that the excess risk of the min-max truncated estimator / is upper bounded 
by provided that n > Cdy/d for a large enough constant C. 

Let us discuss the case when X is some observed random variable whose 
distribution is only approximately known. Namely let us assume that {ipj)j^i is 

some basis of functions in L2 [P] with some known coefficient x, where P is an 
approximation of the true distribution of X in the sense that the density of the true 
distribution P of X with respect to the distribution P is in the range [r]"^/'^, r]). In 
this situation, the coefficient x satisfies the inequality x ^ VX- Indeed 

Let us conclude this section with some scenario for the case when X is a 
real- valued random variable. Let us consider the distribution function of P 

F{x) = P(X < x). 

Then, if P has no atoms, the distribution of F(X) is uniform in (0, 1). Starting 
from some suitable partial basis of Ij2 [(0, 1), U] where U is the uniform 

distribution, like the ones discussed above, we can build a basis for our problem 
as ^ 

^,(X)=(p,[F(X)]. 

Moreover, if P is absolutely continuous with respect to P with density g, then 
PoF~^ is absolutely continuous with respect to PoF~^, with density (yfoF^^, and 



19 



of course, the fact that g takes values in (77"^^^, 77) impUes the same property for 
goF~^. Thus, if X is the coefficient corresponding to <Pj{U) when U is the uniform 
random variable on the unit interval, then the true coefficient x (corresponding to 
(pj {X)) will be such that x < VX- 

3.3. Computation of the estimator. For ease of description of the algo- 
rithm, we will write X for ip{X), which is equivalent to considering without loss 
of generality that the input space is and that the functions (fi, . . . ,(pd are the 
coordinate functions. Therefore, the function fo maps an input x to {9, x). 
Let us introduce 

L0) = a{{e,X,)-Yi)\ 
For any subset of indices / C {1, . . . , n}, let us define 

We suggest the following heuristics to compute an approximation of 

argmin sup T){9, 9'). 

• Start from Ji = {1, . . . , n} with the empirical risk minimizer 

^1 = argmin = ^(«""). 

• At step number k, compute 




• Consider the sets 



JkM = 




where ^ is the (pseudo-)inverse of the matrix Qk- 
• Let us define 

OkM = argrnmrj^,(^), 

JkM = e 4 : \Li{^kM) -Lii^k)\ < 1}, 
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^fc,2(r?) =arg mill rj^ 2 

(^fe,4) = arg min max 'D(ek,e{r)),Oj), 
'neK+,ee{i,2} j=i,...,k 

4+1 = Jk,ek{Vk), 

Ok+i = Ok/^{r]k)- 

• Stop when 

max V^+uOj) > 0, 

j=l,...,k 

and set 9 — Okas the final estimator of 9. 

Note that there will be at most n steps, since Ik+i ^ /fc and in practice much less 
in this iterative scheme. Let us give some justification for this proposal. Let us 
notice first that 

Ti{9 + h,9)^ naX{\\9 + hf - \\9f) 

n 

+ 5^ ^ (a [2{h, X,) {{9, X,) - Y,) + (/i, X,f] ) . 

i=l 

Hopefully, 9 = argmin^gj^d {R{fe) + A||6'|p) is in some small neighbourhood of 
9k already, according to the distance defined by Q ~ Q^. So we may try to look 
for improvements of 9k by exploring neighbourhoods of 9k of increasing sizes 
with respect to some approximation of the relevant norm ||^||q = E [{9, X)"^] . 

Since the truncation function ip is constant on (— oo, —1] and [1, +oo), the 
map 9 I—)- D{9,9k) induces a decomposition of the parameter space into cells 
corresponding to different sets / of examples. Indeed, such a set / is associated 
to the set 6/ of 9 such that Li{9) - Li{9k) < 1 if and only if i e /. Although 
this may not be the case, we will do as if the map 9 i-> 'D{9,9k) restricted to the 
cell Qi reached its minimum at some interior point of C/, and approximates this 
minimizer by the minimizer of rj. 

The idea is to remove first the examples which will become inactive in the 
closest cells to the current estimate 9k. The cells for which the contribution of 
example number i is constant are delimited by at most four parallel hyperplanes. 

It is easy to see that the square of the inverse of the distance of 9k to the closest 
of these hyperplanes is equal to 
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Indeed, this distance is the infimum of \\Qj! h\\, where /i is a solution of 

^ a 

It is computed by considering h of the form h = ^WQ^^^"^ XiW^^Q'i^^ Xi ^^'^ solv- 
ing an equation of order two in ^. 

This explains the proposed choice of Jk,iiv)- Then a first estimate 9k i(r]) is 
computed on the basis of this reduced sample, and the sample is readjusted to 
Jk,2{f]) by checking which constraints are really activated in the computation of 
'D{9k,i{'r]),9k). The estimated parameter is then readjusted taking into account 
the readjusted sample (this could as a variant be iterated more than once). Now 
that we have some new candidates 9k,eif])^ we check the minimax property against 
them to elect Jfc+i and 6'fc_|_i. Since we did not check the minimax property against 
the whole parameter set = R*^, we have no theoretical warranty for this simpli- 
fied algorithm. Nonetheless, similar computations to what we did could prove that 
we are close to solving minj=i^...^fc R{fe)^ since we checked the minimax property 

on the reduced parameter set {9j,j = 1, . . . , k}. Thus the proposed heuristics is 
capable of improving on the performance of the ordinary least squares estimator, 
while being guaranteed not to degrade its performance significantly. 

3.4. Synthetic experiments. In Section [3AT1 we detail the three kinds of 
noises we work with. Then, Sections r3.4.2[|3.4.3] and [3A4l describe the three types 
of functional relationships between the input, the output and the noise involved in 
our experiments. A motivation for choosing these input-output distributions was 
the ability to compute exactly the excess risk, and thus to compare easily estima- 
tors. Section [3.4.51 provides details about the implementation, its computational 
efficiency and the main conclusions of the numerical experiments. Figures and 
tables are postponed to Appendix |El 

3.4.1. Noise distributions. In our experiments, we consider three types of noise 
that are centered and with unit variance: 

• the standard Gaussian noise: W ~ >f(0, 1), 

• a heavy-tailed noise defined by: W = sign(V^)/|V^|^/^, with V ~ X(0, 1) a 
standard Gaussian random variable and q = 2.01 (the real number q is taken 
strictly larger than 2 as for q = 2, the random variable W would not admit 
a finite second moment). 

• a mixture of a Dirac random variable with a low-variance Gaussian ran- 
dom variable defined by: with probability p, W = ^^(1 — p)/p, and with 
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probability 1 — p, W is drawn from 



1 — p '1 — p (1 — py J 

The parameter p E [p, 1] characterizes the part of the variance of W ex- 
plained by the Gaussian part of the mixture. Note that this noise admits 
exponential moments, but for n of order 1 /p, the Dirac part of the mixture 
generates low signal to noise points. 

3.4.2. Independent normalized covariates (INC{n, d)). In INC(n, d), the input- 
output pair is such that 

Y = {e*,X) + aW, 

where the components of X are independent standard normal distributions, 9* = 
(10, . . . , 10)^ G R"', and a = 10. 

3.4.3. Highly correlated covariates (HCC{n, d) ). In HCC(?2, d), the input-output 
pair is such that 

Y = {e*,X) + aW, 

where X is a multivariate centered normal Gaussian with covariance matrix Q 
obtained by drawing a (d, (i)-matrix A of uniform random variables in [0, 1] and 
by computing Q = AA^ , 6* = (10, . . . , 10)^ G and a = 10. So the only dif- 
ference with the setting of Section [3.4.2l is the correlation between the covariates. 

3.4.4. Trigonometric series (TS{n, d) ). Let X be a uniform random variable on 
[0, 1]. Let d be an even number. Let 

if{X) = ( cos(27rX), . . . , cos(rf7rX), sin(27rX), . . . , sin(rf7rX))^. 
In TS(n, d), the input-output pair is such that 

Y = 20X2 - lOX -^ + aW, 
with a = 10. One can check that this implies 

^ /^20 20 _10 10 

U''""7r2(f)2' vr'---' nil) 

3.4.5. Experiments. 
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Choice of the parameters and implementation details. Our min-max trun- 
cated algorithm has two parameters a and A. In the subsequent experiments, we 
set the ridge parameter A to the natural default choice for it: A = 0. For the trun- 
cation parameter a, according to our analysis (see (13.91) ). it roughly should be of 
order l/cr^ up to kurtosis coefficients. By using the ordinary least squares estima- 
tor, we roughly estimate this value, and test values of a in a geometric grid (of 8 
points) around it (with ratio 3). Cross-validation can be used to select the final a. 
Nevertheless, it is computationally expensive and is significantly outperformed in 
our experiments by the following simple procedure: start with the smallest a in 
the geometric grid and increase it as long as ^ = 6'i, that is as long as we stop at 
the end of the first iteration and output the empirical risk minimizer. 

To compute or 9k,2iv)^ needs to determine a least squares estimate 

(for a modified sample). To reduce the computational burden, we do not want to 
test all possible values of t] (note that there are at most n values leading to different 
estimates). Our experiments show that testing only three levels of i] is sufficient. 
Precisely, we sort the quantity 




by decreasing order and consider rj being the first, 5-th and 25-th value of the 
ordered list. Overall, in our experiments, the computational complexity is approx- 
imately fifty times larger than the one of computing the ordinary least squares 
estimator. 

Results. The tables and figures have been gathered in Appendix |El Tables [Hand 
|2]give the results for the mixture noise. Tables [31 14] and [5] provide the results for 
the heavy-tailed noise and the standard Gaussian noise. Each line of the tables has 
been obtained after 1000 generations of the training set. These results show that 
the min-max truncated estimator is often equal to while it ensures impres- 

sive consistent improvements when it differs from f '^™^\ In this latter case, the 
number of points that are not considered in /, i.e. the number of points with low 
signal to noise ratio, varies a lot from 1 to 150 and is often of order 30. Note that 
not only the points that we expect to be considered as outliers (i.e. very large out- 
put points) are erased, and that these points seem to be taken out by local groups: 
see Figures 1 and 2 in which the erased points are marked by surrounding circles. 

Besides, the heavier the noise tail is (and also the larger the variance of the 
noise is), the more often the truncation modifies the initial ordinary least squares 
estimator, and the more improvements we get from the min-max truncated es- 
timator, which also becomes much more robust than the ordinary least squares 
estimator (see the confidence intervals in the tables). 
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4. A SIMPLE TIGHT RISK BOUND FOR A SOPHISTICATED PAC-BAYES 

ALGORITHM 



A disadvantage of the min-max estimator proposed in the previous section is 
that its theoretical guarantee depends on kurtosis like coefficients. In this section, 
we provide a more sophisticated estimator, having a simple theoretical excess risk 
bound, which is independent of these kurtosis like quantities when we assume 
Loo-boundedness of the set 5". 

We consider that the set is bounded so that we can define the "prior" distri- 
bution TT as the uniform distribution on J' (i.e., the one induced by the Lebesgue 
distribution on © c R*^ renormalized to get 71(5") = 1). Let A > and 

W,{f, f) = A{ [Y, - /(X,)]' - [r. - /'(X,)]'}. 

Introduce 

We consider the "posterior" distribution tt on the set 5" with density: 

dTT exp[-£(/)] 
d-K^" /exp[-£(/0]7r(d/0' 

To understand intuitively why this distribution concentrates on functions with low 
risk, one should think that when A is small enough, 1 — Wj(/, /') + |Wi(/, /')^ 
is close to e'^^''^'^'\ and consequently 

n I, n 

^{f)^xY^Y^-f{X^)Y + \og / 7r(ci/')exp{-Aj][r,-/'(X,)]'}, 

1=1 i=i 

and 

exp{-AEr=i[>^.-/ra]'} 



dn'' ' /exp{-AEr=i[>^. - f'{x,)fMdn • 

The following theorem gives a d/n convergence rate for the randomized algorithm 
which draws the prediction function from IF according to the distribution tt. 

Theorem 4. 1 Assume that 5" has a diameter H for L°°-norm: 

sup \h{x)-h{x)\^H (4.3) 

and that, for some cr > 0, 

supE{[y - f*{X)Y\X = x} < (7^ < +00. (4.4) 
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Let f be a prediction function drawn from the distribution n defined in (14.21 page 
25]) and depending on the parameter A > 0. Then for any Q < rj' < 1 — A(2cr+/7)^ 
and e > 0, with probability (with respect to the distribution P'^^fr generating the 
observations Zi, . . . , Zn and the randomized prediction function f) at least 1 — e, 
we have 



2 C,d + C2 log(25-l) 



R{f) - Rifn < i2a + H) 

n 

with 

log(il±^) 2 
77(1 — 7] — 1]') ri[l — T] — 7]') 

In particular for A = 0.32(2cr + H)^"^ and r]' = 0.18, we get 

^2 16.6 rf+ 12.5 log(2e-i) 



R{f) - R{f*) < {2cT + HY 
Besides if f* G argmin j^^^^^R{f), then with probability at least 1 — e, we have 



n 

Proof. This is a direct consequence of Theorem 15.51 (page [331). Lemma [53] 
(page[3T]) and Lemma [53[ (page [35]) . □ 

If we know that belongs to some bounded ball in J'nn, then one can define a 
bounded 5" as this ball, use the previous theorem and obtain an excess risk bound 
with respect to 

Remark 4.1 Let us discuss this result. On the positive side, we have ad/n con- 
vergence rate in expectation and in deviations. It has no extra logarithmic factor. 
It does not require any particular assumption on the smallest eigenvalue of the 
covariance matrix. To achieve exponential deviations, a uniformly bounded sec- 
ond moment of the output knowing the input is surprisingly sufficient: we do not 
require the traditional exponential moment condition on the output. Appendix [A] 
(page [64]) argues that the uniformly bounded conditional second moment assump- 
tion cannot be replaced with just a bounded second moment condition. 

On the negative side, the estimator is rather complicated. When the target is 
to predict as well as the best linear combination up to a small additive term, 
it requires the knowledge of a -bounded ball in which lies and an upper 
bound on sup2.gjE{[y — /i*j^(X)]^|X = x}. The looser this knowledge is, the 
bigger the constant in front of d/n is. 

Finally, we propose a randomized algorithm consisting in drawing the pre- 
diction function according to tt. As usual, by convexity of the loss function. 
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the risk of the deterministic estimator /determ = / f^idf) satisfies -R(/detei-m) < 
J R{f)'JT{df), so that, after some pretty standard computations, one can prove that 
for any e > 0, with probability at least 1 — e: 



n 

for some appropriate numerical constant k > 0. 

Remark 4.2 The previous result was expressing boundedness in terms of the 
L°° diameter of the set of functions 5". By using Lemma [577] (page [35l) instead of 
Lemma [531 (page [35]). Theorem 14.1 1 still holds without assuming (14.31) and (14.41) . 
but by replacing {2a + H)'^ hy 



V 



2 J sup E(/2(X)[r-/*(X)P) 

/e?,i„:E[/(X)2]=l 



+ / sup E([/'(X)-/"(X)P) / sup E[/4(X)] 

V /'./"e^f Y /G3^,i„:E[/(X)2]=l 

The quantity V is finite when simultaneously, 6 is bounded, and for any j in 
{1, . . . , d}, the quantities E[(^^^(X)] and F.{^j{X)^[Y - r{X)]^} are finite. 



5. A GENERIC LOCALIZED PAC-BAYES APPROACH 



5.1. Notation and setting. In this section, we drop the restrictions of the 
linear least squares setting considered in the other sections in order to focus on the 
ideas underlying the estimator and the results presented in Section |4l To do this, 
we consider that the loss incurred by predicting y' while the correct output is y is 
i{y, y') (and is not necessarily equal to {y — y')"^). The quality of a (prediction) 
function / : X — )• R is measured by its risk 

R{f) = E{I[YJ{X)]}. 

We still consider the problem of predicting (at least) as well as the best function in 
a given set of functions 5" (but 5" is not necessarily a subset of a finite dimensional 
linear space). Let /* still denote a function minimizing the risk among functions 
in 3": /* G argmin^g^^ Rif)- For simplicity, we assume that it exists. The excess 
risk is defined by 

R{f) = R{f)-R{n. 
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Let £: Z-xIFx?"— j-Ebea function such that £{Z, f, f) represent^ how 
worse / predicts than /' on the data Z . Let us introduce the real- valued ran- 
dom processes L : (/, f) ^ i{Z, f, f) and : (/, /') ^ /, /'), where 
Z, Zi, . . . , Z„ denote i.i.d. random variables with distribution P. 

Let vr and tt* be two (prior) probability distributions on 3^. We assume the 
following integrability condition. 

Condition I. For any fE3^,we have 

J E{exp[L(/,/')]}V(rf/') < +00, (5.1) 
and / - — p ^^^•^^ < +00. (5.2) 

We consider the real- valued processes 

n 

L{f,f) = J2WJ'), (5.3) 

1=1 

£(/) = log J exp[L(/, f')]7T*{df), (5.4) 

L\fJ') = -nlog{E[exp[-L(/, /')]]}, (5-5) 
L«(/,/') =^log{E[exp[L(/,/')]]}, (5.6) 
and £«(/) = log|/exp[L«(/,/)]^*(rf/')}- (5-7) 

Essentially, the quantities L(/, /'), V'{f, /') and L^{f, f) represent how worse is 
the prediction from / than from /' with respect to the training data or in expecta- 
tion. By Jensen's inequality, we have 

< nE(L) = E(L) < LK (5.8) 

The quantities £(/) and £"(/) should be understood as some kind of (empirical 
or expected) excess risk of the prediction function / with respect to an implicit 
reference induced by the integral over 5". 

For a distribution p on 5" absolutely continuous w.r.t. tt, let ^ denote the 

an 

density of p w.r.t. tt. For any real-valued (measurable) function h defined on 5" 



^While the natural choice in the least squares setting is i{{X, Y),f, /') — [Y — fiX)]'^ — 
[Y — f'{X)]'^, we will see that for heavy-tailed outputs, it is preferable to consider the following 
soft-truncated version of it, up to a scaling factor A > 0: ({{X, Y),f, /') T{X[{Y - f{X)f - 
{Y - f'{X)f]), with T{x) = - log(l - X + x"^ /2). Equality (El page|28]l corresponds to ( 1411 
pagelZSTi with this choice of function I and for the choice tt* = tt. 
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such that / exp[/i(/)]7r((i/) < +00, we define the distribution -n^ on 5" by its 
density: 

d-K J exp[h{f')]Tc{df')' 

We will use the posterior distribution: 

*(/) = ^(/) = ^"P|-^<^>' . (5.10) 

Finally, for any /3 > 0, we will use the following measures of the size (or com- 
plexity) of J around the target function: 

J*(/3) = -log{/exp[-/3i?(/)]7r*(^^/)} 

and 

3{{3) = -\og\jexp[-PR{f)]7r{df)}. 



5.2. The localized PAC-Bayes bound. With the notation introduced in 
the previous section, we have the following risk bound for any randomized esti- 
mator. 

Theorem 5.1 Assume that vr, tt*, 5" and i satisfy the integrability conditions 
(15.11) and (15. 2[ page |28l). Let p be a (posterior) probability distribution on 5" ad- 
mitting a density with respect to n depending on Zi, . . . , Z^. Let j be a prediction 
function drawn from the distribution p. Then for any 7 > 0, 7* > and e > 0, 
with probability {with respect to the distribution P®'^p generating the observa- 
tions Zi, . . . ,Zn and the randomized prediction function f) at least 1 — e: 

[L\f, f) + 7*i?(/)] ^%.n{df) - iHf ) 

< J*(7*) - J(7) - log{/exp[-e»(/)]7r(rf/) 



+ log 



^(/)] +21og(25-i). (5.11) 



Proof. See Section [641 (page [571) . □ 

Some extra work will be needed to prove that Inequality (15.1 II) provides an 
upper bound on the excess risk R{f) of the estimator /. As we will see in the next 
sections, despite the —'jR(f) term and provided that 7 is sufficiently small, the 
lefthand-side will be essentially lower bounded by \R{f) with A > 0, while, by 
choosing p = n, the estimator does not appear in the righthand-side. 
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5.3. Application under an exponential moment condition. The es- 
timator proposed in Section|4]and Theorem 15.11 seems rather unnatural (or at least 
complicated) at first sight. The goal of this section is twofold. First it shows that 
under exponential moment conditions (i.e., stronger assumptions than the ones in 
Theorem 14.11 when the linear least square setting is considered), one can have a 
much simpler estimator than the one consisting in drawing a function according to 
the distribution (14.21) with £ given by (14.11) and yet still obtain a (i/n convergence 
rate. Secondly it illustrates Theorem l5.1l in a different and simpler way than the 
one we will use to prove Theorem 14.11 

In this section, we consider the following variance and complexity assump- 
tions. 

Condition VI. There exist A > and < i] < 1 such that for any function 

/ G 3^, wehaveE|exp{A£[r,/(X)]}| < +oo, 

log{E{exp|A [e[YJ{X)]-£[Y,r{X)]]}]] 

<A(l + r/)[i?(/)-i?(r)], 

and log{E{exp{-A[£[F,/(X)] -£[r,r(X)]]}}} 

< .Xil - v)[Rif) - Rif)]. 

Condition C. There exist a probability distribution tt, and constants D > 
and G > such that for any < a < /3, 

f J exp{-a[R{f) - R{n]}n{df) \ ^ f GP 

^ \jexp{-mif)-Rin]Wf)j - n « 

Theorem 5.2 Assume that VI and C are satisfied. Let ti^^*'^^^ be the probability 
distribution on 5" defined by its density 

^^(Gibbs) _ exp{-A Er=l hy^. f{X.)]} 



/exp{-AEr=i^>.,/'(^.)]}^(0' 

where A > and the distribution vr are those appearing respectively in VI and C. 
Let f & 3^ be a function drawn according to this Gibbs distribution. Then for any 
rj' such that < rj' < 1 — r] (where 7] is the constant appearing in Yl) and any 
e > 0, with probability at least 1 — e, we have 



n 

with 

V' 



log(^) 
C[ = " \ and C' 
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Proof. We consider £[(X, Y), f, f] = X{i[Y, f{X)] -![¥, /'(X)] }, where 
A is the constant appearing in the variance assumption. Let us take 7* = and 
let 71* be the Dirac distribution at /*: vr* ({/*}) = 1. Then Condition VI implies 
Condition I (pagel28l) and we can apply Theorem 15. II We have 

L(/, /') = X{i[Y, f{X)] - i[Y, f{X)] }, 

n n 

£(/) = A 5^ i[Y,, f{x,)] -xJ2 i% nxi)] , 



7T = 7T 



i=l i=l 
(Gibbs) 



L^(/) = -nIog{E[exp[-L(/,r^ 
£«(/)= nlog{E[exp[L(/,r)]]} 
and Assumption VI leads to: 

log{E [exp [L{f, r )] ] } < A(l + r^) [R{f) - 
and log{E[exp[-L(/,r)]]} < -A(l - - /?(/*)]. 

Thus choosing p = n, (15.1 II) gives 

[Xn{l - - j]R{f) < -J(7)+J[An(l + r/)] +21og(2£-i). 
Accordingly by the complexity assumption, for 7 < Ara(l + r^), we get 

[An(l -v)- 7]i?(/) < D log (^^MilZZ) ^ + 2 log(2e-i), 

which implies the announced result. □ 

Let us conclude this section by mentioning settings in which assumptions VI 
and C are satisfied. 

Lemma 5.3 Let Q be a bounded convex set o/R'^, and (pi, . . . be d square 
integrable prediction functions. Assume that 

d 

i=i 

TT is the uniform distribution on J (i.e., the one coming from the uniform distribu- 
tion on Q), and that there exist < 61 < 62 such that for any ?/ G R, the function 
iy : y' i{y, y') admits a second derivative satisfying: for any y' G R., 

hi < Qy') < 62. 
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Then Condition C holds for the above uniform n, G = (^t^d D = d. 

Besides when f* = /j*^ (i.e., miiiji? = mmg^zRd R{fg)), Condition C holds 
for the above uniform tt, G = 62 /&i flin J D = d/2. 

Proof. See Section[63](page[6T]). □ 

Remark 5.1 In particular, for the least squares loss y') = — y')^, we have 
61 = 62 = 2 so that condition C holds with vr the uniform distribution on3^, D = d 
and G=l, and with D = d/2 and G = 1 when /* = f*^. 

Lemma 5.4 Assume that there exist < bi < b2, A > and M > such that 
for any y eR, the functions iy : y' i{y, y') are twice differentiable and satisfy: 



foranyy' e-R, h<t{y')<h, 



A~^\l'y[f\X)]\ 



X 



X 



and for any X E X, E|exp 

Assume that J' is convex and has a diameter H for L°°-norm: 

sup \fi{x) - f2{x)\= H. 

In this case Condition VI holds for any (A, rj) such that 

M^exp{Hb2/A) 



< M. 



(5.12) 
(5.13) 



\A^ 
7] > — — exp 



26i 

and < A < (2AH)^^ is small enough to ensure 77 < 1. 
Proof. See Section [6]6] (page [621). □ 



5.4. Application without exponential moment condition. When we 
do not have finite exponential moments as assumed by Condition VI (page l30l). 
e.g., when E{exp{A{£[F, /(X)] - i[YJ*{X)]}}} = +00 for any A > and 
some function / in 3", we cannot apply Theorem 15.11 with i[{X,Y), f, f'~\ = 
X{i[Y, f{X)] - £[¥, f'{X)] } (because of the £» term). However, we can ap- 
ply it to the soft truncated excess loss 

e[{x,Y)j,f] =t(a{£[f,/(x)] -e[Yj'{x)]} 

with T{x) = — log(l— a:;+a;^/2). This section provides a result similar to Theorem 
I5.2l in which condition VI is replaced by the following condition. 

Condition V2. For any function /, the random variable i[Y, f{X)] -i[Y, f* (X)] 
is square integrable and there exists ^ > such that for any function /, 



E 



[[i[YJ{X)]-i[YJ*{X)]]']<V[Rif)-R{f*)]. 
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Theorem 5.5 Assume that Conditions V2 above and C (page\3d!j) are satisfied. 
LetO < X < and 

£[{X,Y)JJ']=T(^X{e[YJ{X)]-i[Y,f{X)]}y (5.14) 

with 

T(a:) = -log(l -x + xV2). (5.15) 

Let f & 3^ be a function drawn according to the distribution fc defined in (I5.10[ 
page [29l) with £ defined in (|5.4[ page [28l) and n* = n the distribution appearing 
in Condition C. Then for any < rj' < 1 — XV and e > 0, with probability at 
least 1 — e, we have 

with 



1 

C[ = - — ^-^iLnL^ and C' = and r] = XV. 

ri[l — rj — rj') rj[\ — rj — rj') 

In particular, for X = 0.321^^^ andrj' = 0.18, we get 

J ^ ^16.6Z) + 12.51og(2VGe-) 

n 

Proof. We apply Theorem [5TT] for I given by (15.141) and vr* = tt. Let 

W{f, f) = X{i[Y, fix)] - i[Y, fix)] } for any /, /' G 3^. 



Since logw < u — 1 for any m > 0, we have 

= -nlogE(l -W + W'^/2) > niEW - EW^/2). 
Moreover, from Assumption V2, 

< EWif, f*f + Eiy(/', < X'VRif) + AVi?(/'), (5.16) 
hence, by introducing rj = XV, 

L\f, f) > Xn [Rif) - Rif) - XVRif) - XVRif')] 

= Xn[il - rj)Rif) - (1 + rj)Rif)] . (5.17) 
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Noting that 

exp T(m) = — = — = ^ <l + u + —. 

we see that 

L" = nlog|E exp[r(l^)] 'j < n[E{W) + E{W^) /2]. 
Using (15.161) and still rj = XV, we get 



LKf, n < An[i?(/) - R{f) + r^Rif) + r/i?(/) 

= \n{l+r])R{f)-Xn{l-r])R{n, 

and 

£«(/) < An(l + 7])Rif) - J(An(l - ^)). (5.18) 
Plugging (15.171) and (15.181) in (15.111) for p = ^, we obtain 

[An(l - r/) - 7]i?(/) + [7* " An(l + 7;)]/i?(/)7r_,*;j,(rf/) 

< J(7*) - J(7) + J(An(l + r/)) - J(Ar2(l - r])) + 2 log(2£-^). 

By the complexity assumption, choosing 7* = An(l + 77) and 7 < An(l — r]\ we 
get 

[An(l -ri)- ^]R{f) < Dlog + 21og(2e-i), 

hence the desired result by considering 7 = Xnr]' with 77' < 1 — 77. □ 

Remark 5.2 The estimator seems abnormally complicated at first sight. This 
remark aims at explaining why we were not able to consider a simpler estimator. 

In Section [53l in which we consider the exponential moment condition VI, 
we took i[{X, Y), /, /'] = X{£[Y, f{X)] - £[Y, f'{X)] } and n* as the Dirac 
distribution at /*. For these choices, one can easily check that tt does not depend 
on /*. 

In the absence of an exponential moment condition, we cannot consider the 
function £[(X,r ),/,/'] = X{£[YJ{X)] /'(X)] } but a truncated version 

of it. The truncation function T we use in Theorem 15.51 can be replaced by the 
simpler function m 1— )• (m V —M) A M for some appropriate constant M > 
but this would lead to a bound with worse constants, without really simplifying 
the algorithm. The precise choice T{x) = — log(l — x + x^/2) comes from the 
remarkable property: there exist second order polynomial and such that 
-p^ < exp[T(M)] < P«(n) and P\u)P^{u) < 1 + O(m^) for m ^ 0, which are 
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reasonable properties to ask in order to ensure that (15.81) . and consequently (15.111) . 
are tight. 

Besides, if we take £ as in (15.141) with T a truncation function and n* as the 
Dirac distribution at /*, then vr would depend on /*, and is consequently not 
observable. This is the reason why we do not consider tt* as the Dirac distribution 
at /*, but IT* = TT. This lead to the estimator considered in Theorems 15.51 and |4. 1 [ 

Remark 5.3 Theorem l5.5l still holds for the same randomized estimator in which 
(|5.15[ page [331) is replaced with 

T(x) = log(l + x + xV2). 

Condition V2 holds under weak assumptions as illustrated by the following 
lemma. 

Lemma 5.6 Consider the least squares setting: £{y,y') = {y — y'Y. Assume that 
J is convex and has a diameter H for L°°-norm: 

sup \Mx)-Mx)\=H 

and that for some a > 0, we have 

supE{[r - /*(X)]2|X = x} < < +CX). (5.19) 

Then Condition V2 holds for V = {2a + H)'^. 

Proof. See Section[0(page[63l)- □ 

Lemma 5.7 Consider the least squares setting: i{y, y') = {y — y'Y- Assume that 
3^ (i.e., Q) isbounded, andthatforany j G {1, . . . , d}, we have'E[^p'j{X)~\ < + oo 
and ¥.{(pj{Xf[Y - < +oo. Then Condition V2 holds for 



V 



sup E(/2(X)[F-/*(X)]2) 

/e3-,i„:E[/(X)2]=l 



+ Jsup E([/'(X)-/"(X)P) / sup E[/4(X)] 



1 2 



Proof. See Section [Q] (page 1641) . □ 



6. Proofs 

6.1. Main ideas of the proofs . The goal of this section is to explain the key 
ingredients appearing in the proofs which both allows to obtain sub-exponential 
tails for the excess risk under a non-exponential moment assumption and get rid 
of the logarithmic factor in the excess risk bound. 
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6.1.1. Sub-exponential tails under a non-exponential moment assumption via trun- 
cation. Let us start with the idea allowing us to prove exponential inequalities 
under just a moment assumption (instead of the traditional exponential moment 
assumption). To understand it, we can consider the (apparently) simplistic 1- 
dimensional situation in which we have = R and the marginal distribution of 
i^i {X) is the Dirac distribution at 1 . In this case, the risk of the prediction function 
fe is R{fe) = E(F - 6^ = E{Y - e*f + (EF - Of, so that the least squares re- 
gression problem boils down to the estimation of the mean of the output variable. 
If we only assume that Y admits a finite second moment, say EF^ < 1, it is not 
clear whether for any e > 0, it is possible to find such that with probability at 
least 1 — 2e, 

Rife) - Rin = my) - o)' < c-^^^, (6.i) 

for some numerical constant c. Indeed, from Chebyshev's inequality, the trivial 
choice 6 = ' just satisfies: with probability at least 1 — 2e, 

R{fs)-Rir)<-, 

ne 



which is far from the objective (16.11) for small confidence levels (consider e = 
ex-p(—y/n) for instance). The key idea is thus to average (soft) truncated values 
of the outputs. This is performed by taking 

i=l ^ 



with A = ■ Since we have 

/ A^ \ A^ 

log E exp(nA^) = n log M + AE(y) + y E(r2) j < nXE{Y) + n—, 

the exponential Chebyshev's inequality (see Lemma 16.101) guarantees that with 
probability at least 1 — £, we have n\{6 — E(F)) < + log(£:^^), hence 



n 



Replacing F by — F in the previous argument, we obtain that with probability at 
least 1 — £, we have 



nA{E(r) + ±|:iog(l-Ay^ + ^)} 



A^ 

< n— + log(e~^) 
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Since -log(l+x+xV2) < log(l-x+xV2), this implies E(F)-^ < y^^^^^. 
The two previous inequalities imply Inequality (|6.1I) (for c = 2), showing that 
sub-exponential tails are achievable even when we only assume that the random 
variable admits a finite second moment (see HlOil for more details on the robust 
estimation of the mean of a random variable). 

6.1.2. Localized PAC-Bayesian inequalities to eliminate a logarithm factor. 



High level description of the PAC-Bayesian approach and the localization ar- 
gument. The analysis of statistical inference generally relies on upper bounding 
the supremum of an empirical process x indexed by the functions in a model 5". 
One central tool to obtain these bounds is the concentration inequalities. An al- 
ternative approach, called the PAC-Bayesian one, consists in using the entropic 
equality 

Eexp (^supjy p{df)x{f)-K{py)^ = j 7r'(d/)Eexp(x(/)). (6.2) 

where M is the set of probability distributions on J and K{p, tt') is the KuUback- 
Leibler divergence (whose definition is recalled in (16.291) ) between p and some 
fixed distribution vr'. 

Let f : 5" — )■ R be an observable process such that for any f e J',we have 

Eexp(x(/)) < 1 

for xif) = - r(/)] and some A > 0. Then (lO) leads to: for any e > 0, 

with probability at least 1 — e, for any distribution p on 3", we have 

/p(d/)fi(/) < l^rnHf) + (6.3) 

The lefthand-side quantity represents the expected risk with respect to the distri- 
bution p. To get the smallest upper bound on this quantity, a natural choice of the 
(posterior) distribution p is obtained by minimizing the righthand-side, that is by 
taking p = 7i'_^f (with the notation introduced in (15.91) ). This distribution con- 
centrates on functions / G 5" for which f(/) is small. Without prior knowledge, 
one may want to choose a prior distribution vr' = vr which is rather "flat" (e.g., 
the one induced by the Lebesgue measure in the case of a model 5" defined by 
a bounded parameter set in some Euclidean space). Consequently the KuUback- 
Leibler divergence K(p, vr'), which should be seen as the complexity term, might 
be excessively large. 
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To overcome the lack of prior information and the resulting high complexity 
term, one can alternatively use a more "localized" prior distribution vr' = ti-^r 
for some /3 > 0. Since the righthand-side of (16.31) is then no longer observable, an 
empirical upper bound on K{p, tx^pr) is required. It is obtained by writing 

K{p, t,_pr) = K{p, n) + log ( y rr{df) exp[-(3R{f)]^ +P j p{df)RU), 

and by controlling the two non-observable terms by their empirical versions, call- 
ing for additional PAC-Bayesian inequalities. 

Low level description of localization. To simplify a more detailed presentation 
of the PAC-Bayesian localization argument, we will consider a setting in which 
3^, v^i, v^d and the outputs are bounded almost surely, specifically assume 
P(for any / G 3^, |F - /(X)| < 1) = 1. 

Introduce ^(m) = [exp(M) - 1 - u]/u^ for any m > 0, R{f) = R{f) - R{f*) 
and f(/) = r{f) — r(f*) for any f E 3^. Let tt be a distribution on 5" and 
A(/, /') = E{[r-/(X)]2-[F-/*(X)]2}lThe starting point is the following 
PAC-Bayesian inequality: for any 5 > and A > 0, with probability at least 1 — £, 
for any distribution p on 5", we have 

p{df)Rif) < J p(rf/M/) + ^^(^) / p(rf/)A(/,r) 

, K(p,7r) + log(e-i) 
+ ^ . (6.4) 

This inequality derives from the duality formula given in (16.301) . the inequality 

Eexp(^{[F-r(X)]2-[F-/(X)]2+i?(/)-i?(/*)}-A!vi/(^)A(/,r)) < 1, 
and Lemma [OOl (see (Jl Theorem 8.1]). Since 

A(/, n = E{[f{x) - r (x)]2[2r - f{x) - nx)]'} 

<4E{[/(X)-r(X)f}<4/?(/), 
by taking A = n/6. Inequality (16.41) implies 

jKdmu) < 2 /pw)f(/) + io^l^M)±M£:!). (6.5) 

The distribution TT ((if) = -p — '^^p[~"^(/)/51 vrfrff) minimizes the righthand-side, 

J exp[-nf(/')/5]vr(rf/' ) 

and we have 

log {J7c{df) exp[-nf(/)/5]) + log(e~^) 



idf)Rif) < 10- 



n 
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Let Tijj be the uniform distribution on 5" (i.e., the one coming from the uniform 
distribution on 6). For tt = tt^/, using similar arguments to the ones developed 
in Section [631 it can be shown that — log (/ 7r((i/) exp[— raf(/)/5] < cd\og{n) 
for some constant c depending only on sup^j/^j ||/ — /'||oo. This implies a ^Mli 
convergence rate of the excess risk of the randomized algorithm associated with vr. 
The localization idea from [JJ allows to prove 

jpamu) < 2 jpmru) + i^fSk^niML^^ (6.6) 

with Ti'{df) = exp[-c«r-(/)] ^^^j-^ ^^j. ^Qj^g < C < 1/5. The key dif- 

1 cxp[-Cnr{/')]7r(d/' ) 

ference with (16.51) is that the KuUback-Leibler term is now much smaller for the 
distributions p which concentrates on low empirical risk functions, like it. Since 
— log [Jit' (df) exp[—nf{f) /5] < erf for some constant c depending only on C (see 
Lemma [531) . this allows to get rid of the log n factor and obtain a convergence rate 
of order d/n. 

The proof of (16.61) is rather intricate but the central idea is to use (16.51) for 

TT(df) = -p — '^^p[~"-^(/)/^] . Tirjfdf), and control the non-observable KuUback- 

^ ^ J cxphnfl(/')/5^(4f' ) ''^•'^ 

Leibler term by c / p{df)R{f) plus K{p, n') up to minor additive terms. 

Let us conclude this section by pointing out some difficulties and possibili- 
ties when considering unbounded Y — fe{X). The sketches of proof presented 
hereafter are far from being actual proofs as some technical problems are hidden. 
Full proofs will be given in the later sections. For unbounded Y — fe{X), In- 
equality (16.41) no longer holds, but by using the soft truncation argument of the 
previous section, one can prove a similar inequality in which / p{df)f{f) is re- 
placed with \Jp{df) Er=i log (1 + W,U. f*) + WfU, f*)/^) for W,if, /*) = 
^{[Y - /(Xi)]2 -[Y - /*(Xi)]2} for A > a parameter of the bound. One 
significant difficulty is that the minimizer of this quantity is no longer observable 
(since /* is unknown). Nevertheless the quantity can be upper bounded by the 
observable one: 



max^ I p{df)^hg(^l + W.{fJ') + 



This explains why the procedures in Section [3] make appear a min-max. 

Another interesting idea is to use Gaussian distributions for vr and p, which 
are respectively centered at 0* and and with covariance matrix proportional to 
the identity matrix. The interest of these choices comes essentially from the co- 
existence of the two following properties: the distribution tt concentrates on a 
neighbourhood of the best prediction function so the complexity term K{p, tt) 
can be much smaller than the one obtained for tt the uniform distribution on 5" 
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(this is again the localization idea), and K{p, n) and, when 6 = H^, the integrals 
with respect to p can be explicitly computed in terms of R{9) and other rather 
simple quantities, which implies that the modified inequality (16.41) gets a tractable 
form for further computations, provided nevertheless some assumptions on the 
eigenvalues of the matrix Q. The idea of using PAC-Bayesian inequalities with 
Gaussian prior and posterior distributions has first been proposed by Langford and 
Shawe-Taylor [|14|| in the context of linear classification. 

6.2. Proofs of Theorems 12.11 and [2T21 To shorten the formulae, we will 
write X for ^(X), which is equivalent to considering without loss of generality 
that the input space is R'' and that the functions Lpi, . . . ,Lpd are the coordinate 
functions. Therefore, the function fg maps an input x to {9,x). With a slight 
abuse of notation, R(9) will denote the risk of this prediction function. 

Let us first assume that the matrix Qx = Q + \I is positive definite. This 
indeed does not restrict the generality of our study, even in the case when A = 0, 
as we will discuss later (Remark [6TI) . Consider the change of coordinates 

X = Ql^''^X. 

Let us introduce 

m = ¥.[{{e,x)-Yfl 

so that 

R{Q'^'e) = R{e) = ¥.[{{e,x) - Yf]. 

Let 

Consider 

1 

i=l 

r{0) = -Y.{{0,X,)-Y;)\ (6.8) 

1=1 

60 = argmini?(^^) + X\\Q-,'^'e\\\ (6.9) 
dee 

e e aTgmmr(e) + X\\ef, (6.10) 
eee 

e, = Q'J^e e argminr(^) + X\\Q~^^^9\\^. (6.11) 
eee 

For a > 0, let us introduce the notation 
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W{e) = a{{{9,X) - yY - {{9o,X) - Y)']. 

For any 02 E and (3 > 0, let us consider the Gaussian distribution centered 
at 02 

Pe2id0) = (-^^ exp (-^\\0- 02\A d0. 



Lemma 6.1 For any rj > Q and a > 0, with probability at least 1 — exp {—rj), for 
any 02 E E^, 

-njpe, {d0) logj 1 - E [W{0)\ + E [W{0f] /2 

n 

< - E iog{i - wm + wmV'^]) + npe,.pe,) + v. 

i=l 

where %{pg^, pg^) is the Kullback-Leibler divergence function : 



J Pe2{d0) 



log 



(0) 



Proof. 



dpoo 

i-w,{0) + wmv^ 



< 1, 



thus with probability at least 1 — exp (—77) 

log P,, (^^) 11 i^E[Wi0)]+^Wi0y]/2 
We conclude from the convex inequality (see (81 page 159]) 

log(/peo(c^^)exp[/i(^)]) > Jpg,id0)hi0)-%ipg„pe,). 



< 



rj. 



□ 



Let us compute some useful quantities 



^1 
2 ' 



JpeMO) [Wi0)] = ajpeAd0){0 - 02. X)^ + W{02 

= W{02) + J^, 



(6.12) 



(6.13) 
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(6.14) 



Jpe,{de) [W{ef] = a^Jpe,{d9){9 - 0o,Xf{{e + eo,X) - 2yY 

= a7 ^^Q^ ^(^Q _Q^ + Q^_Q^^x){{e-e2 + 92 + eo,x)- 2Y) 

= Jpe,{d9)^a{e-02,Xf + 2a{9-92,X){{92,X)-Y)+W{e2) " 
= IPe2idO) [a''{e - e2,X)^ + 4.a\e - ^2,X)'((^2,^) -Yf + w{e2f 

+ 2a{e-e2,xfw{e2) 

2a{{e2,x) -yY + w{e2) +w{e2f. (6.i5) 



Using the fact that 

2a{{e2,x)-Yy ^W{e2)^2a{{eQ,X)-YY ^?>w{e2), 

and that for any real numbers a and h, 6ab < 9a^ + 6^, we get 
Lemma 6.2 

\\xr 



jpe,{de)[w{e)] ^w{e2) + a^ 



(6.16) 



2a\\Xf 



2a{{9o,X)-YY + 3Wi92) 



+ 



(6.17) 



(6.18) 



anJ the same holds true when W is replaced with Wi and {X, Y) with (Xj, Yj). 
Another important thing to realize is that 

E[||X||2] = E[Tr(XX^)] = E[Tr(g-'/'XX^g-'/')] 

= E[Tr(g-iXX^)] = IV[Q-^E(XX^)] 

= iv(g-i(g;, - A/)) =d-A'iv(g-i) = l>. (6.i9) 
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We can weaken Lemma 16.11 (page ST]) noticing that for any real number x, 
X < — log(l — x) and 



log ( 1 - X + y 1 = log 



l + x + x^/2 
1 + xV4 



<log( l + x + —j <x + Y- 



We obtain with probability at least 1 — exp(—T]) 



nE[W{92)] + —E[\\Xf] - 5nE[Wi92f] 



E 



j 2na^X\\'' 



< 



J2\ W,{92) + 5W,{92f 



i=l 



+ 



+ 



((^o,X,)-F)' + 



2a^\\X,\[ 

+^\\02-9or+v. 



a 



Noticing that for any real numbers a and h, Aab < a +4:b , we can then bound 

2 

2 



'W{92y = {92 - 9o, X)\{92 + ^0, X) - 2Yy 



h-9o,xy 



h-9o,X) + 2{{9o,X)-Y) 
= {92 - 9o,X)^ + m - 9o,Xf{{9o,X) - Y) 
+ A{92-9o,Xy{{0o,X)-YY 

< 2{92 - 9o,Xy + m- 9o,W{{0o,X) - Y)'. 

Theorem 6.3 Let us put 

_ 1 " _ _ 

D = -Y^WXif (let us remind that D = E[||Xf] from i67l9\) ) . 



B^ = 2E\\\Xr{{9o,X)-Yy 



Bi = -Y.\\\X,r{{9„X,)-Y^'' 



i=l 
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S2 = 2e[||X||^ 
9. ^ 

B2 

Bs = 40sup|E[(H,X)2((eo,^) - Y)^] ■.ueR'^, \\u\\ = l}, 

{/in ^ 
- ^{u,X,)\{eo,X^) - Yi}' : « e \\u\\ = l}, 
^ i=l 

Bi = 10sup|e[(m,X) 
B4 — sup 



r 1 n _ 



u e R"', = 1}, 
I = 1 



1=1 



W/?/z probability at least 1 — exp(— r7),/or any ^2 £ E,*^, 



/3 



11^2-^0 



na2(S4 + 54)11^2-^0! 



< Yl ^^(^2) + ^(i^ - £>) + ^(Si + Si) + -^(^2 + S2) + 77. 



na 



na 



1=1 



Let us now assume that 62 E Q and let us use the fact that 6 is a convex set and 
that 60 = arg min^gQ i?(6') + A||(5;^^^^6'|p. Introduce 9^ = argminQ^^d R{9) + 

M\Q~x^^^^f-^^ we have 

R{e) + A||Q;'/'^||2 = 11^ - 94^ + R{9.) + X\\Q-'/\\\^ 

the vector 9o is uniquely defined as the projection of ^* on © for the EucUdean 
distance, and for any 92 E 

a-'E[wi92)] + \\\Q-,'^'92r - \\\Q-,'%r 

= R{92) - m) + M\Qx'^'02r - A||g-'/'^o||' 
= II — — \\9o — 9*\\'^ 
= \\92- 9o\\' + 2{92 - 9o, 9o - 9.) > \\92 - ^oH'. (6.20) 

This and the inequaUty 

n 
i=l 

leads to the following result. 
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Theorem 6.4 With probability at least 1 - exp{-r]), 



R{9) + x\\er - mf[R{9) + Mief] 



is not greater than the smallest positive non degenerate root of the following poly- 
nomial equation as soon as it has one 



= \ max(5 - 0) + + Bi) + ^(^2 + B^) + ^. 

Proof. Let us remark first that when the polynomial appearing in the theorem 
has two distinct roots, they are of the same sign, due to the sign of its constant 
coefficient. Let VL be the event of probability at least 1 — exp(— 77) described in 
Theorem 16 . 3 1 (page l43l) . For any realization of this event for which the polynomial 
described in Theorem 16.41 does not have two distinct positive roots, the statement 
of Theorem l6.4l is void, and therefore fulfilled. Let us consider now the case when 
the polynomial in question has two distinct positive roots xi < x^- Consider in 
this case the random (trivially nonempty) closed convex set 

= G : R{Q) + A||^f < + A||^'f ] + ^^). 

Let 6*3 G argminggg r{Q) + A||^^p and G argmingge '^(^) + A||^^|p. We see 
from Theorem [O] that 



^(^3) + Ahf < R{0o) + All^of + ^4^' ^^-21) 

because it cannot be larger from the construction of 0. On the other hand, since 
C 0, the line segment [6*3, ^^4] is such that [6*3,^^4] fl C argmin^gg r(^^) + 
A||^|p. We can therefore apply equation (16.211) to any point of [^^3, 614] fl 0, which 
proves that [6*3, 6*4] n0 is an open subset of [6*3, 6*4]. But it is also a closed subset by 
construction, and therefore, as it is non empty and [^^3, 64] is connected, it proves 
that [6*3, 6*4] n = [6*3, 6*4], and thus that 6*4 G 0. This can be applied to any choice 
of 63 G argminggg r(6') + A||6'|p and 64 G argmingge ''(^) + A||6'p, proving 
that argminege r{9) + A 116*11^ C argmin^gg r(6) + A||^^||^ and therefore that any 
64 G argmingge r(6') + A is such that 

Rie^) + A||^4f < inf [R{e) + All^f ] + xi. 
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because the values between xi and X2 are excluded by Theorem l6.3[ □ 

The actual convergence speed of the least squares estimator ^ on 6 will depend 
on the speed of convergence of the "empirical bounds" I3k towards their expecta- 
tions. We can rephrase the previous theorem in the following more practical way: 



Theorem 6.5 Let r]o,T]i, . . . ,7]^ be positive real numbers. With probability at 
least 

4 

1 - P(5 > D + r/o) - 5^ P(Bfc - Sfc > - exp(-r75), 

k=l 

+ A||^P — infgge [_R(6') + A||6'|p] is smaller than the smallest non degenerate 
positive root of 



+ + . + ^(21^. + ..) + ^, (6.22) 

where we can optimize the values of a > and f3 > 0, since this equation has 
non random coefficients. For example, taking for simplicity 

1 

a 



853 + 4773' 



na 

^ 2 ' 



we obtain 



2^4 + 774 2 16r/o(253 + 773) , 85i+4r7i 

X X = 1 

4133 + 27^3 n n 

32(2^3 + 773) (2^2 + 7/2) , 87/5(2^3 + 7/3) 



+ 



77^ 77 



6.2.1. Proof of Theorem I2.il Let us now deduce Theorem 12. 1 1 (page [T3l) from 
Theorem 16. 5[ Let us first remark that with probability at least 1 — e/2 



B2 

D <D + J — , 
V en 

because the variance of D is less than For a given e > 0, let us take rjQ = 

7/1 = Bi, 7/2 = B2, rj-i = Bs and 774 = B4. We get that Rx{9) - infeee Rxid) is 
smaller than the smallest positive non degenerate root of 

B4 . 48^3 [B^ l2Bi 288B2B3 241og(3/e)53 

X-——X = \ — + + + , 

2i53 n \ ne n n 
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with probability at least 

5e ^ 
l- — -J2^iBk>Bk + Vk). 

k=l 

According to the weak law of large numbers, there is such that for any n > n^, 

4 



J2nBk> Bk + Vk) <e/Q. 



k=l 

Thus, increasing Us and the constants to absorb the second order terms, we see 
that for some and any n > rifr, with probability at least 1 — £, the excess risk is 
less than the smallest positive root of 

54 2 135i , 241og(3/£)53 

X —X = \ . 

2B-i n n 

Now, as soon as ac < 1/4, the smallest positive root of x — ax^ = c is 
This means that for n large enough, with probability at least I — e, 

e n n 

which is precisely the statement of Theorem 12.11 (page [T3l). up to some change of 
notation. 

6.2.2. Proof of Theorem \2. 21 Let us now weaken Theorem 16.41 in order to make 
a more explicit non asymptotic result and obtain Theorem 12. 2[ From now on, we 
will assume that A = 0. We start by giving bounds on the quantity defined in 
Theorem 16. 3l in terms of 

B= sup wfwioinfm'- 

/espan{ipi,...,(/3£j}-{0} 

Since we have 

||Xf = WQl^'^XW" < dB, 

we get 

- 1 " — 
d=-\"\\Xif < dB, 

n 

i=l 

Bi = 2E[||Xf ((^^o,X) - F)'] < 2dBR{n, 
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2 " p 



i=l 



< 2dBr{f* 



B2 = 2E|||X 

2 

n 



< 2d'B 



'2 d2 



B, = -y\\x,r<2d'B', 

i=l 

53 = 40sup|e[(m,X)^((^o,^) :MeRM|M|| = l| <40BR{f*), 

Bs = sup j - 5^(n,X,)2((^^o,X.) - : « e hll = 1} < 405r(r) 



54 = 10sup|e[(m,X) 



: M G IImII = U < 105 



i?4 = sup 

Let us put 



u,X,)^ : M e R^ = U < lOB^ 



[ n 



i=l 



ao 



2dB + 4:dBa[R{f*) + r{f*)] + T] IGB^d^ 



and 



an an 
a, = 3/4-maB[R{n+rin], 



2 ' 



Theorem 16.41 applied with (3 = na/2 implies that with probability at least 1 — r/ 
the excess risk R(f^^™^) — R{f*) is upper bounded by the smallest positive root 
of aix — a2X^ = ao as soon as a\ > 4aoa2. In particular, setting e = exp(— r^) 
when (16.231) holds, we have 



2an 



< 



2ao 



+ V af — 4aoa2 "^i 



We conclude that 



Theorem 6.6 For any a > and e > 0, with probability at least 1 — e, if the 
inequality 



80 



(2 + Aa[R{f*) + r{f*)])Bd + \og{e-^) fABd 



n 



n 



< (— -40a[i?(r)+r(r)]j (6.23) 
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holds, then we have 

~ \ n \ n 

(6.24) 

where 3 = 8/(3« - 160a^B[R{f*) + r(/*)]) 

Now, the Bienayme-Chebyshev inequality implies 

P(r(/*) - R{f*) >t) < ^(^^^ ^~2^^^ ^ - r(X)]Vnt2. 

Under the finite moment assumption of Theorem 12.21 we obtain that for any e > 
1/n, with probability at least 1 — e, 

r{n < R{n + vnY-f*{x)]^ 

From Theorem 16. 61 and a union bound, by taking 

a = (80B[2R{n + y/E[Y - f*{X)]^y\ 
we get that with probability 1 — 2£, 



\ n \ n 

with 3i = 640 (2R{f*) + [F - ^ This concludes the proof of 



Theorem EI 



Remark 6.1 Let us indicate now how to handle the case when Q is degenerate. 
Let us consider the linear subspace S of R"' spanned by the eigenvectors of Q cor- 
responding to positive eigenvalues. Then almost surely SpanjXj, i = 1, . . . ,n} C 
S. Indeed for any 9 in the kernel of Q, E((^, X)^) = implies that {9, X) = 
almost surely, and considering a basis of the kernel, we see that X E S almost 
surely, S being orthogonal to the kernel of Q. Thus we can restrict the problem to 
S, as soon as we choose 

n 

9 G span{Xi,...,X„} n argmin V((^,Xi) 

9 

i=l 

or equivalently with the notation X = {(pj{Xi))i<i^n,i<j<d and Y = [F^]"^]^, 

^ G imX"^ n argmin 11 X6' — Fll^ 
e 

This proves that the results of this section apply to this special choice of the em- 
pirical least squares estimator. Since we have R"^ = ker X ©im X^, this choice is 
unique. 
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6.3. Proof of Theorem 13.11 We use a similar notation as in Section \6l\ 
we write X for (f{X). Therefore, the function fg maps an input x to {9, x). We 
consider the change of coordinates 

X = Ql^''^X. 

Thus, from (I6l9l) . we have E[||X||2] = D. We will use 

R{e) = ^[{{e,x)-Yfl 

so that R{Q^'^e) = E [( {e, X) -Yf]= Rife) . Let 

e = {Q\%eee}. 

Consider 

^o = argmin|:R(^) + A||Q-'/'^f ). 
eee ^ J 

~ —1/2 

We thus have 6 = 6q, and 



a=^E[((0o,X)-F)^], 

Y = sup — — , — , 

E(m^)^/^ E(iixr)^/^ 

E(||XP) D ^ 

E[((eo,X)-r)Y^' 



a2 



r= ||e|| = max 

e,e'ee 

For a > 0, we introduce 

= {e,x^) - F„ = {e,x) - Y 

U{e) = a{{e,x,) - Y,)\ L{e) = a{{9,x) - yY 
W0) = L0) - I,(^o), w{e) = L{e) - L{eo), 



and 



r'{e, 6') = x{\\Q-,'/'er - WQ-.'^'e'f) + i- j^Hm - W)) 



na 

i=l 
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Let e = Q^'^e e e. We have 



-r'i9o,9) = r'{9,9o) <maxr'{9,9,) < 7 + max /(^o, ^i), (6 25) 

where 7 = max r'(9, 9i) — inf max r'(6', 9i) is a quantity which can be made 

0ie0 6»G0 eie0 

arbitrary small by choice of the estimator. By using an upper bound r'(6'o, 6*1) that 
holds uniformly in 9i, we will control both left and right hand sides of (16.251) . 
To achieve this, we will upper bound 

r\9,,9,) = K\\Ql"'9,f - \\Ql''^9,f) + — Y^^[- (6.26) 



by the expectation of a distribution depending on 9i of a quantity that does not 
depend on 9i, and then use the PAC-Bayesian argument to control this expectation 
uniformly in 9i. The distribution depending on 9i should therefore be taken such 
that for any 9i E Q, its KuUback-Leibler divergence with respect to some fixed 
distribution is small (at least when 9i is close to ^^q)- 
Let us start with the following result. 

Lemma 6.7 Let f , g : R H be two Lebesgue measurable functions such that 
f{x) < g{x), x G R. Let us assume that there exists h E R such that x 1— t- 
g{x) + is convex. Then for any probability distribution on the real line, 

< +m,n{sup/ - inf /,|va.(,)}. 

Proof. Let us put xq = j xn{dx) The function 

X g{x) + - xo)^ 



is convex. Thus, by Jensen's inequality 



h 



f{xo) < g{xo) < I I2{dx) 
On the other hand 



g{x) + '-{x - xo)^ 



g{x)fi{dx) + — Var(yu). 



f{xo) < sup / < sup / + j [g{x) - inf /]/i(c?a 



J g{x)^{dx) + sup / — inf /. 
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The lemma is a combination of these two inequalities. □ 

The above lemma will be used with f = g = ip, where is the increasing 
influence function 



r-iog(2), 

log(l + x + a;V2), 



X < -1, 
-1 < X < 0, 



log(l-a; + a;V2), < x < 1, 



Uog(2), 
Since we have for any a; G R 



x> 1. 



log ( 1 — X + 



X 



log 



1 +x + |- 

1 + ^ 



< log I 1 + X + 



X 



the function satisfies for any x G R 



X 



— log 1 — X + — < ■ijj{x) < log 1 + X + 



X 



Moreover 

^'(x) 



1 — X 



1 - X + 



ij"{x) 



x(x — 2) 



2(l-x + f) 



2,2 > -2, 0<x<l, 



showing (by symmetry) that the function x z/'(x) + 2x^ is convex on the real 
line. 

For any 9' G R"' and /3 > 0, we consider the Gaussian distribution with mena 9' 
and covariance /S^^I: 



exp ( -^\\9-9'f ] d9. 



From Lemmas 16.21 and 1677] (with fi the distribution oi — Wi{9) 
9 is drawn from pe^ and for a fixed pair (Xj, 1^^)), we can see that 



when 



i^[-W,{9{)\=i^}^jpeM0) 



-W,{9) + 
-Wi{9) + 



a\\X,P 



|2n 



/3 



+ min<^log(4), Vaip,^ [Li 
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Let us compute 



+ — — ■ ^^-^^^ 



Let ^ e (0, 1). Now we can remark that 

< — + l-J ■ 

We get 



< 



.n{log(4),Var,,^ [L,(^)] } 

= min|log(4), ^ 
J pei(c^^)niin|log(4), 



"I" "I" 



/3e I' _ ^(1-0 

< / pe,{dO) min|log(4), " ^' ' + ^} 



+ mm{log(4),^J-|}- 
Let us now put a — < 2.VJ ,h — a -\- a?' log(4) < 8.7 and let us remark that 

mm{log(4), x} + mm{log(4), y} 

< log[l + amin{log(4), x}] +log(l + ay) 

< log(l + ax + x,yeR+. 



Thus 



mm 



{log(4),Var,,^ [Lm]] 
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We can then remark that 



(3^ 



^{x) + log(l + y) = \og[exp['ip{x)] + y exp['ip{x)]\ 
< log[exp[V^(x)] + 2y] < log(^l + X + y + 2y^ , 

Thus, putting Co = a + - — -, we get 
with 



X eR,y eR^ 



(6.28) 



8aa\\X,fL,{9) W||X,f 

Similarly, we define y4(^^) by replacing (Xj, Fj) by (X, Y). Since we have 

Eexp ('^log[A,(^)] -nlog[EA(^)]') = 1, 

from the usual PAC-Bayesian argument, we have with probability at least 1 — e, 
for any 6'i e R"*, 

/ PeMQ){Y^og\A,{Q)\\j -n j peAde)\og[A{d)] < K{pe,, pe,) + \og{e-') 

< ^ + log(^ ) 

From (16.261) and (|6.28l) . with probability at least I — e, for any 6i G R.'^, we get 



r'(^o,^i) < -logh + E 



a 



peAde)\-w{e) + 



laa\\XfL{e) 4coa2||Xf 



Pi 
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2na 



na 



Now from Km and = -L{ei) + / pe,{M)L{e), we have 



p,, (d^^) ( -W{e) + = Var,,^ \L{e)] + w{e,f 



/3 



(3 



Proposition 6.8 With probability at least 1 - e,for any 9i e R'^, 



r'{eo,ei) < -logh + E 



a 



+ 



(l + 8a/e + 4co)a'||X||^ 



/3 



2na 



^2 



+ 



< E 



(1 + 8a/e + 4co)a||Xr ] ^ - gpf ^ log(g-^) 



(3' 



2na 



na 



+K\\Qx'^\r-\\Qx'^\r)- 

By using the triangular inequality and Cauchy-Scwarz's inequality, we get 

^E[Wi9iY] = e| [(^1 - 9o,Xf + 2(^1 - 9o,X)J{9o)Y} 

< {e[{9, - 9o,X)f^' + 2E[(0i - 9o,X)f^'HJ{0orY^'y 

9i — 9q 



11^1-^0 

+ 2\\9,-94a^jK'x\\E 



< 



1 — ^0] 



\\9i-9o 



Qmax ~l~ ^ 
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and 



^E[\\xrL{e,)] = E\[\\x\\{e,-eo,x) + ||x||j(^^o)]^ 



a 



I V ^max + A J 



Let us put 

R[e) = m + \\\Qi^'^e\\\ 

Ci=4(2 + 8a/0, 

C2 =4(l + 8a/^ + 4co), 

n 

We have proved the following result. 
Proposition 6.9 With probability at least 1 - e, for any 9i E R'^, 

+ —KD[VK'cr+\\9i-9o\\y/x\ + 



2 



^ /3||gi-gof ^ log(£-i) 



2na na 

Let us assume from now on that 9i E 6, our convex bounded parameter set. In 
this case, as seen in (16.201) . we have ||^^o ~ ^ Ri^i) — -R(^o)- We can also use 
the fact that 



^ 2 



[V^a + 11^1 - 9o\\^Y < 2K'a^ + 2x11^1 - 9, 
We deduce from these remarks that with probability at least 1 — e. 



2(3 4/32 na 

Let us assume that n > AcynxD and let us choose 

na 

^ 2 ' 
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^ - I ^ 4:CinxD 

"~ 2x[2v^a+||e||v^]' 



n 



to get 



Plugging this into (|6.25|) . we get 

\ -S< r'{e, 9o) < max ^ \ ] + + 5 = + 5, 

2 eise V ^ / 

hence 

R{e) - ^(^o) < 27 + 46. 

Computing the numerical values of the constants when ^ = 0.8 gives ci < 95 and 
C2 < 1511. 

6.4. Proof of Theorem 15. 1[ We use the standard way of obtaining PAC 
bounds through upper bounds on Laplace transform of appropriate random vari- 
ables. This argument is synthetized in the following result. 

Lemma 6.10 For any e > and any real-valued random variable V such that 
E[exp(l^)] < 1, with probability at least 1 — e, we have 



V < lofffe 



Letl^i(/) = j [L\fJ)+^*R{f)]rr*_^.^{df)-^R{f) 

- r(7*) + %) + log (/ exp[-£(/)]7r(rf/)^ - log 

andV^2 = -log(^y" exp[-£(/)]7r(rf/)^ + ( / ^M'^^ifM^f) 
To prove the theorem, according to Lemma [6. 101 it suffices to prove that 

E{/exp[Vl(/)]p(rf/)}<l and E [/ exp(V^2)p(rf/) 
These two inequalities are proved in the following two sections. 



< 1. 
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6.4.1. Prc>o/o/E|jexp[Vi(/)]p((i/)| < 1. From Jensen's inequality, we have 

[L\f,f)+fR{f)\n*_^,^{df) 

= I [L{fJ) + l*mKrRidf) + I [L\f,f)-L{f,f)]7r%.^{df) 

From Jensen's inequality again, 
£(/) = -log J exp[L(/,/)]7r*(d/) 

= -logy" exp[L(/,/)+7*i?(/)]7rl^,^(ci/)-log j exp [-7*i?(/)]7r*(ci/) 

From the two previous inequalities, we get 

Vi{f)< J [LCfJ) + i*R{f)V-rRm 

+ log j exp[L\fJ)-L{f,f)y{df)-^R{f) 
-J*(7*)+J(7)+log(^y" exp[-£(/)]7r(rf/)^ -log 
= / [Lif,f)+j*Rif)]7r%.^idf) 

+ log j exp[L\fJ)-L{fJ)]n*{df)-^R{f) 



dp 
dir 



if) 



-J*(7*) + J(7) -£(/)- log 
<log J exp[L\fJ) - L{fJ)]7r*_^,fi{df){df) 

-7^(/)+%)-log 
= log j exp[L\f, f) - L{f, f)y_^.n{df) + log 
hence, by using Fubini's inequality and the equality 



d7r_ 



■yR 



dp 



if) 
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E{exp[-L(/,/)]} =exp[-L^(/,/)], 
we obtain E j exp[V^{f)]p{f) 

<eJ ^exp[L\fJ)-L{fJ)]n*^^,^{df)y_^ji{df) 



1. 



6.4.2. Proof of E 



Jexp{V2)p{df) 



< 1. It relies on the following result. 



Lemma 6.11 Let W be a real-valued measurable function defined on a product 
space A\ x Ai and let /ii and /i2 be probability distributions on respectively A\ 
and A2. 

• ifEa^^^^ jlog Ea2^^2 {exp[-W(ai,a2)] } | < +00, then we have 



E, 



log 



Ea2^^2 {exp[-W(ai,a2)] } 



< - loK<^ E 



exp[-Eaj^^j W(ai,a2 



//"W > on yii X yi2 ^f"*^ Ea2~/X2 ['^('^15 '^2)] ^1 < +00, then 



E E 



Proof. 



Let 71 be a measurable space and M denote the set of probability distribu- 
tions on A. The KuUback-Leibler divergence between a distribution p and 
a distribution n is 



Ea~p log 



+00 



dp ^ 

dpL 



if p p., 

otherwise, 



(6.29) 



dp 



where — denotes as usual the density of p w.r.t. /x. The KuUback-Leibler 

dp. 

divergence satisfies the duality formula (see, e.g., |[8l page 159]): for any 
real-valued measurable function h defined on A, 

inf {E„^p h{a) + K{p, fi)} = - log E,^^ jexp [-h{a)] } . (6.30) 



59 



By using twice (16.301) and Fubini's theorem, we have 



-E,,^^, log 



exp[-W(ai,a2)] || 
E,,^^, jinf {E,,^p [W{aua2)]+K{p,fi2)}} 

Eaa-p [W(ai,a2)] +K{p,fl2 
exp{-Eaj^^^ [W(ai,a2)]} 



< inf <^ E 



log<^ E 



By using twice (16.301) and the first assertion of Lemma [67111 we have 



Eai-^i |Ea2^^2 W(ai,a2 
= ^a^^^,l jexpj-log 



= E, 



lai^^i |exp|inf 



E„ 



^a2^A.2 {exp[-logW(ai,a2)] } 
'^a2^p {\og[W{ai,a2)]} + K{p,fi2) }} 
ai-Mi |exp|Ea2^p log[W(ai,a2)]^ 
< inf|exp[if(p, /i2)] expjE, 



< inf 



^ ^exp[K{p, fi2)]'^ai^iMi |exp|E„ 



5a2~p i log 



L p L L 

= exp|-log|Ea2^^2 jexp 



[W(ai,a2)] 

= expjinfjEaj^p logjEa^^^^ [W(ai, 02)] } +K{p,p2) 

^ ^ ^ log{E„,^^, [W(ai,a2)]}]}}} 
= E,,.^, |e,,^^, [W{a,,a2)y'y\ □ 

From Lemma [6.1 H and Fubini's theorem, since V2 does not depend on /, we 
have 

E 



/ exp{V2)p{df)\ =E[exp{V2 

= J exp [-£«(/)]7r(rf/)E| [/ exp [-£(/)] 7r(rf/) 
< / exp [-£«(/)] 7r(rf/) {/E [exp [£(/)] ] '\{df) 
= J exp [-£«(/)] vr(rf/) |/E [ / exp [L(/, /')] tt* (df 



-1 



-1 >, -1 
n{df) 



/exp[-£«(/)]7r(rf/) / Jexp[L^{f,f)y{df) Tr{df) 



-1 



"1 



1. 
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This concludes the proof that for any 7 > 0, 7* > and e > 0, with probability 
(with respect to the distribution P®"p generating the observations Zi, . . . , Zn and 
the randomized prediction function /) at least 1 — 26: 

V^if) + V2<2\ogie-'). 

6.5. Proof of Lemma 15.31 Let us look at 3^ from the point of view of /*. 
Precisely let Sj^d(0, 1) be the sphere of R'^ centered at the origin and with radius 
1 and 

d 

i=i 

Introduce 

n = {(t)e§;3u>0 s.t. /* + e J}. 

For any G fi, let Uff, = supjw > : f* + ucp E 3^}. Since tt is the uniform 
distribution on the convex set 5" (i.e., the one coming from the uniform distribution 
on 6), we have 



/ 



exp{-a[Rif) - Rin]}nidf) 

exp{-a[R{f* + u(f)) - R{f*)]}u'^-^dud(j). 



'<l>en Jo 

Let = E[0(X)4(/*(X))] and = E[(j)'^{X)] . Since 

r eargmin^,^E{£y[/W]}, 

we have > (and c,^ = if both —(f) and belong to Vl). Moreover from 
Taylor's expansion, 

— ^ < R{f + u(j)) -R{f )- uc^ < — - — . 

Introduce 

Jq"*^ exp|— q;[mc0 + ^bia^u'^]^u'^~^du 
Jq"^ exp{— /3[uc<^ + ^b2a^u'^]}u'^~'^du 
For any < a < /5, we have 

J exp{-a[R{f) - R{n]}iT{df) 
Jexp{-P[R{f) - R{f*)]Hdf) - 
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For any > 1, by a change of variable, 

^ ^ ^^ /o"'^ exp{-a[Cuc^ + lbia^C'^u'^]}u'^-^du 

< C^sup exp{/3[MC0 + \h2a4,u^] - a[CuC(f, + |6ia</,C^'W^]}. 

M>0 

By taking C = V(&2/3)/(&ia) when = and C = ^/(M)7(M) V 
otherwise, we obtain < C^, hence 

r r \r>/f\ D^/*Mi f ^ log (r^) when sup c,^ = 0, 



^°^'jexp{-/3[i?(/)-i?(/*)]}7r(rf/)J - , 



which proves the announced result. 



d\og{\ - — V — ) otherwise. 



6.6. Proof of Lemma [531 For -{2AHY^ < A < {2AH)-^, introduce the 
random variables 

F = f{X) F* = riX), 

n = e'yiF*) + {F - F*) [ {l-t)i'^{F* +t{F ~F*))dt, 

Jo 

L = X[i{Y, F) - i{Y, F*)], 

and the quantities 

M^A'^exp{Hb2/A) 
^ " 20F(1 - \\\AH) 

and 

i = iJ62/2 + Alog(M) = ^log{M2exp[i762/(2A)]}. 
From Taylor-Lagrange formula, we have 

L = \{F-F*)n. 
SinceE[exp(|^]|/A) | X] < Mexp [1/62/(2^)] , Lemma|Djgives 

M^a^exp{Hb2/A) 



\og< E 



exp{a[n-'E{n\X)]/A}\X 



< 



2^(1 -\a\) 



for any — 1 < a < 1, and 

\E{n\X)\<A. (6.31) 
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By considering a = AX[f{x) - f*{x)] G [-1/2; 1/2] for fixed a; G X, we get 

logjE exp[L-E(L|X)] \X | < X'^{F - F*ya{X). (6.32) 



Let us put moreover 



L = E{L\X) + a{X)X\F - F*y. 



Since -{2AH)-^ < X< i2AH)-\ we have L < \X\HA + a{X)X^H^ < h' with 
h' = A/{2A) + M'^exp{Hb2/A)/{Ay/l^). Since L - E(L) = L- E{L\X) + 
E(L|X) - E(L), by using Lemma|DH (l632l) and (lOTl) . we obtain 



log<^E 



exp [L - E(L)] 



< loK<^ E 



exp[L - E(L)]J I + X\{X)E[{F - F*f] 

< E{L^)g{b') + X^a{X)E[{F - F*)^] 
<X'E[{F-F*f] [A'g{b') + a{X)], 



with g{u) = [exp(M) — 1 — m] /m^. Computations show that for any —{2 AH) ^ < 

X < {2AH)-\ 

A^g{b') + a(A) < — exp [m^ exp {Hb2/A) . 

Consequently, for any -{2AHy^ <X< {2AHy^, we have 

log{E[exp{A[£(r,F) 

< X[R{f)-R{f*)] + X^E[{F-F*)^] — exp^M^exp{Hb2/A) . 

Now it remains to notice that E[(F - F*)^] < 2[R{f) - R{f*)]/bi. Indeed 
consider the function = R{f* + t{f - f*)) - R{f*), where / G and 
t G [0; 1]. From the definition of /* and the convexity of 3^, we have > on 
[0; 1]. Besides we have = 0(0) +t0'(O) + f0"(O) for some (t e]0; 1[. So we 
have 0'(O) > 0, and using the lower bound on the convexity, we obtain for t = 1 

bi. 



E{F-F*y < R{f)-R{f*). 
6.1. Proof OF Lemma 15 .61 We have 

e{{[Y - f{X)f -[Y ~ nx)fY^ 

= E([r - f{x)y{2[Y - nx)] + [r - f{x)]y 



(6.33) 
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= E([r - /(X)]^{4E([r - nX)]'\X) 

+ 4E(y - r{x)\x)[r{x) - f{x)] + [r (X) - /(x)]^} 

< E([r - /(X)]2{4a2 + 4a|r (X) - /(X)| + [/* (X) - /(X)]^} 
<E([r-/(X)]2(2a + if) 



<(2a + i7f[i?(/)-i?(r)], 

where the last inequality is the usual relation between excess risk and distance 
using the convexity of 5" (see above (16.331) for a proof). 

6.8. Proof of Lemma \5J\ Let S = {s e J'un : E[s{Xf] = 1}. Using the 
triangular inequality in L^, we get 

2^ 



E({[F-/(X)]2-[F-r(X)]2}^ 

E({2[r - f{x)][Y - r(x)] + [r (X) - f{x)]'y 



< (2^E{[/*(X)-/(X)P[r-/*(X)P} + y/E{[/*(X)-/(X)]4})' 



< 



2v'E([/*(X) - /(X)]2) JsupE(s2(X)[F - /*(X)]2) 



+ E([r(X)-/(X)]2)JsupE[.4(X)] 



< v[i?(/)-i?(r)], 

with 



V 



2jsupE(.2(X)[F-/*(X)]2) 



n 2 



sup E([/'(X) - /"(X)]2) /supE[s4(X)] 

where the last inequality is the usual relation between excess risk and distance 
using the convexity of 3" (see above (16.331) for a proof). 



A. Uniformly bounded conditional variance is necessary to 

REACH d/n RATE 

In this section, we will see that the target (10.21) cannot be reached if we just 
assume that Y has a finite variance and that the functions in 5" are bounded. 
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For this, consider an input space X partitioned into two sets Xi and X2: X = 
Xi U X2 and Xi fl X2 = 0. Let (pi{x) = l^eXi and (p2ix) = 'i-x&X2- Let 5" = 
{eiipi + 62^2; {0^,62) e [-1,1]'}. 

Theorem A. 1 For any estimator f and any training set size n > 1, we have 

snp{ER{f)-R{n}>-^, (A.l) 

where the supremum is taken with respect to all probability distributions such that 
/(reg) g S^andYaiY < 1. 

Proof. Let (3 satisfying < /3 < 1 be some parameter to be chosen later. 
Let Po-, 0" G {—,+}, be two probability distributions on X x R such that for any 

e {-,+}, 

P^(Xi) = l-/3, 

P^{Y = 0\X = x) = 1 foranyxGXi, 

and 



One can easily check that for any a G { — , +}, Varp^(F) = 1 — < 1 and 
f^^^^\x) = aip2 G 5". To prove Theorem I A. 1[ it suffices to prove (lA.ll) when the 
supremum is taken among P G {P_,P+}. This is done by applying Theorem 
8.2 of [3|. Indeed, the pair (P_, P^) forms a (1, f3, /3)-hypercube in the sense of 
Definition 8.2 with edge discrepancy of type I (see (8.5), (8.11) and (10.20) for 
q = 2): di = 1. We obtain 

sup {EP(/) - P(r)} > /3(1 - /3v^), 
Pe{P-,P+} 

which gives the desired result by taking (3 = 1/ {2y/n). □ 
B. Empirical risk minimization on a ball: analysis derived from 

THE WORK OF BiRGE AND MASSART 

We will use the following covering number upper bound [[161 Lemma 1] 



l + crV(3 
2 

= 1-P. 



Y 



VP 



X 



X 



for any x G X2 
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Lemma B.l If 3^ has a diameter H > for -norm (i.e., sup^^j^es^.xex \fi{^)~ 
f2{x)\ = H), then for any < 6 < H, there exists a set IF* C 5", of cardinal- 
ity 1 5"* I < {3H/6Y such that for any f E S' there exists g E 5"* such that 

\\f-9\\oo<S. 

We apply a slightly improved version of Theorem 5 in Birge and Massart Q. 
First for homogeneity purpose, we modify Assumption M2 by replacing the con- 
dition "(T^ > D/n" by "cr^ > B'^D/n" where the constant B is the one appearing 
in (5.3) of O. This modifies Theorem 5 of to the extent that "VI" should be 
replaced with "V-B^". Our second modification is to remove the assumption that 
Wi and Xi are independent. A careful look at the proof shows that the result still 
holds when (5.2) is replaced by: for any x E X, and m > 2 

E,,[M"'{Wi)\Xi = x]< amA"", for alH = 1, . . . , n 

We consider 14^ = F-/*(X), 7(z, /) = {y - f{x)y, A{x,u,v) = \u{x) -v(x)\, 
and M{w) = 2{\w\ + H). From (O, for all m > 2, we have E{[(2(|Vr| + 
H)Y^\X = x\< ^[4M(A + i/)]™. Now consider B' and r such that Assumption 
M2 of [5 1 holds for D = d. Inequality (5.8) for r = 1/2 of [[51 implies that 
for any v > n^{A'^ + H"^) \og{2B' + B'r^fdjn), with probability at least 1 - 



Kexp 



n " 

—nv 



i?(/(enn)) _ j^^f*^ + ^^f*^ _ ^(^(erm)) < [/(erm)(^) _ /*(X)]'} V v) /2 

for some large enough constant k depending on M. Now from Proposition 1 of 
[|5|| and Lemma iBTTl one can take either B' = Q and ry/d = or B' = 3^^/n/d 
and r = 1. By using E{ [/('=™)(X) - /*(X)]^} < i?(/('=™') - R{f*) (since J is 
convex and /* is the orthogonal projection of Y on 3^), and r(/*) — r{f^™^^) > 
(by definition of p^™^), the desired result can be derived. 

Theorem 11.51 provides a d/n rate provided that the geometrical quantity B 
is at most of order n. Inequality (3.2) of ^ allows to bracket B in terms of 
B = sup^,3p,„^^^,.,„^^} ||/||^/E[/(X)]2, namely B < B < Bd. To understand 
better how this quantity behaves and to illustrate some of the presented results, let 
us give the following simple example. 

Example 1. Let Ai, . . . , Ad he a. partition of X, i.e., X = U^^^^Aj. Now 
consider the indicator functions (pj = '\-Apj = 1, . . . </3j is equal to 1 on 
Aj and zero elsewhere. Consider that X and Y are independent and that F is a 
Gaussian random variable with mean 6 and variance a^. In this situation: f^^^ = 
j(reg) _ ^^^^ 0(pj_ According to Theorem ll.il if we know an upper bound H on 

||/freg)||^ = e, we have that the truncated estimator (/("''^ A H) y -H satisfies 
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for some numerical constant k. Let us now apply Theorem IC.ll Introduce pj = 

F{X G Aj) and p^^^ = min^ p^. We have Q = {'Eipj{X)ipk{X)) .^^ = Diag(pj), 

X = 1 and II 6** II = OVd. We can take A = cr and M = 2. From Theorem |Cll 
for A = dL^/n, as soon as A < Pmm, the ridge regression estimator satisfies with 
probability at least 1 — e: 

^(/(ridge)) _ < ^^^d ^ ^^ ^^ 

n V np^i^ J 

for some numerical constant k. When d is large, the term ((i^£^)/(npmin) is felt, 
and leads to suboptimal rates. Specifically, since pj^m < 1/^' the r.h.s. of (IB.ll) is 
greater than /n^, which is much larger than d/n when is much larger than n^^^. 
If Y is not Gaussian but almost surely uniformly bounded by C < +oo, then the 
randomized estimator proposed in Theorem 11.31 satisfies the nicer property: with 
probability at least 1 — e. 



n 



for some numerical constant k. In this example, one can check that B = B' = 
1/Pmin where pram = niinj P(X G Aj). As long as pmin > '^/n, the target (10.11) 
is reached from Corollary II. 5[ Otherwise, without this assumption, the rate is in 
{d\og{n/ d))/n. ■ 



C. Ridge regression analysis from the work of Caponnetto and 

De Vito 



From im, one can derive the following risk bound for the ridge estimator. 

Theorem C.l Let gmm be the smallest eigenvalue of the d x d-product matrix 
Q = {¥jifj{X)ipk{X)).^. LetX = sup^gx Ei=i ^^'^ 11^11 be the Eu- 

clidean norm of the vector of parameters of f^^^ = J2'j=i ^jVj- Let < e < 1/2 
and /Cg = log^(£:^^). Assume that for any x E X, 

E{exp[|r - fL{X)\/A] \X = x]<M. 

For A = {%d£js)/n, if X < gmin. the ridge regression estimator satisfies with 
probability at least 1 — e: 

^(/(ndge))_^(^^*^) < ^(a^ + ^XL^A (C.l) 

for some positive constant k depending only on M. 
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Proof. One can check that f^'''^^'^ e argmin^g^^ r(/) + A ^^^^ ||/||^, where 
"K is the reproducing kernel Hilbert space associated with the kernel K : {x, x') 
Yfj=i^j{x)^k{x'). Introduce /(^) G argmin^^j^ ^(/) + AEj=i Let us use 

Theorem 4 in f6l and the notation defined in their Section 5.2. Let Lp be the column 
vector of functions [pjYj^^, Diag(aj ) denote the diagonal d x rf-matrix whose j-th 
element on the diagonal is Oj, and Id be the d x (i-identity matrix. Let U and 
gi, . . . , be such that UU^ = / and Q = [/Diag(gj)f/^. We have /j*, = if'^O* 
and /(^) = (^^(Q + Xiy^QO*, hence 

/im - /^'^ = ^^f/Diag(A/(g, + \))U^e\ 

After some computations, we obtain that the residual, reconstruction error and 
effective dimension respectively satisfy ^(A) < ^||6'*f , S(A) < 4^||6'*|p, 

^miii ^min 

and < d. The result is obtained by noticing that the leading terms in (34) of 
^ are yi(A) and the term with the effective dimension [N"(A). □ 

The dependence in the sample size n is correct since l/ra is known to be mini- 
max optimal. The dependence on the dimension d is not optimal, as it is observed 
in the example given page [661 Besides the high probability bound (IC.ll) holds 
only for a regularization parameter A depending on the confidence level e. So we 
do not have a single estimator satisfying a PAC bound for every confidence level. 
Finally the dependence on the confidence level is larger than expected. It contains 
an unusual square. The example given page [66] illustrates Theorem [C. II 



D. Some standard upper bounds on log-Laplace transforms 



Lemma D . 1 Let V be a random variable almost surely bounded Zjj 6 G R. Let 

g : u ^ [exp('u) — 1 — u^/u^. 

logjE exp[V~E{V)] j<E{V^)g{b). 

Proof. Since g is an increasing function, we have giV) < g{b). By using the 
inequality log(l + u) < u, we obtain 



log<^E 



exp [V - E{V)] 



-E{V) + log{E[l + V + V^g{V)] } 

<E[V'giV)] <E{V')gib). 



□ 



Lemma D.2 Let V be a real-valued random variable such that E [exp (| < 
M for some M > 0. Then we have \E{V)\ < logM, and for any —1 < a < 1, 

log{E[exp{4V^-E(n]}]}<^^|^. 
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Proof. First note that by Jensen's inequality, we have |E(y)| < log(M). By 
using log(ti) < u — 1 and StirUng's formula, for any — 1 < a < 1, we have 

logjE [expla [V - E{V)] }] } < E exp{a [V - E(\/)] }] } - 1 
^E^^exp{a[V -'E{V)]} -l-a[V -E{V)]^ 
< E|exp[|a||V^ - E(V")|] - 1 - - E(l^)|| 
< E|exp[|y-E(T/)|]| sup| [exp(|Q;|ii) — 1 — \a\u\ exp(— 



I I TTi Tfi 

< E[exp(|y| + |E(T/)|)1 sup V ^^^exp(-'u) 
L J „>n ^ — ' m'. 



u>o ^„ ml 

— m>2 



m-2 



ml u>o ml 

m>2 m>2 



< J— ^ sup exp(-u) = a^M^ \^ ' ' m"" exp(-m) 

— Z-^ rr)\ ,.^n ^ ^ ^ m\ 

m>2 

□ 



< a^M^ > ' ' < — — — . 
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E. Experimental results for the min-max truncated estimator 

DEFINED IN SECTION [33] 



Table 1 : Comparison of the min-max truncated estimator / with the ordinary least 
squares estimator f °^^^ for the mixture noise (see Section [3.4.11) with p = 0.1 
and p = 0.005. In parenthesis, the 95%-confidence intervals for the estimated 
quantities. 









"0 
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1 


a; 














"0 


1 




cn 
fi 










<^ 


"0 




O 
































3 


3 




"0 










nb of 


nb of 


nb of 




H 


H 


w 


INC(n=200,d=l) 


1000 


419 


405 


0.567(±0.083) 


0.178(±0.025) 


1.191(±0.178) 


0.262(±0.052) 


INC(n=200,d=2) 


1000 


506 


498 


1.055(±0.112) 


0.271(±0.030) 


1.884(±0.193) 


0.334(±0.050) 


HCC(n=200,d=2) 


1000 


502 


494 


1.045(±0.103) 


0.267(±0.024) 


1.866(±0.174) 


0.316(±0.032) 


TS(n=200,d=2) 


1000 


561 


554 


1.069(±0.089) 


0.310(±0.027) 


1.720(±0.132) 


0.367(±0.036) 


INC(n=1000,d=2) 


1000 


402 


392 


0.204(±0.015) 


0.109(±0.008) 


0.316(±0.029) 


0.081(±0.011) 


INC(n=1000,d=10) 


1000 


950 


946 


1.030(±0.041) 


0.228(±0.016) 


1.051(±0.042) 


0.207(±0.014) 


HCC(n=1000,d=10) 


1000 


942 


942 


0.980(±0.038) 


0.222(±0.015) 


1.008(±0.039) 


0.203(±0.015) 


TS(n=1000,d=10) 


1000 


976 


973 


1.009(±0.037) 


0.228(±0.017) 


1.018(±0.038) 


0.217(±0.016) 


INC(n=2000,d=2) 


1000 


209 


207 


0.104(±0.007) 


0.078(±0.005) 


0.206(±0.021) 


0.082(±0.012) 


HCC(n=2000,d=2) 


1000 


184 


183 


0.099(±0.007) 


0.076(±0.005) 


0.196(±0.023) 


0.070(±0.010) 


TS(n=2000,d=2) 


1000 


172 


171 


0.101(±0.007) 


0.080(±0.005) 


0.206(±0.020) 


0.083(±0.012) 


INC(n=2000,d=10) 


1000 


669 


669 


0.510(±0.018) 


0.206(±0.012) 


0.572(±0.023) 


0.117(±0.009) 


HCC(n=2000,d=10) 


1000 


669 


669 


0.499(±0.018) 


0.207(±0.013) 


0.561(±0.023) 


0.125(±0.011) 


TS(n=2000,d=10) 


1000 


754 


753 


0.516(±0.018) 


0.195(±0.013) 


0.558(±0.022) 


0.131(±0.011) 
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Table 2: Comparison of the min-max truncated estimator / with the ordinary least 
squares estimator f for the mixture noise (see Section [3.4.11) with p = 0.4 
and p = 0.005. In parenthesis, the 95%-confidence intervals for the estimated 
quantities. 





nb of iterations 


"o 

a; 

o 
c 


"3 

V 

'1 

o 

X 

c 


^ ^ 

"3 


H 


5-^ 
1 

"3 
"3 

H 


1 

"3 

w 


INC(n=200,d=l) 


1000 


234 


211 


0.551(±0.063) 


0.409(±0.042) 


1.211(±0.210) 


0.606(±0.110) 


INC(n=200,d=2) 


1000 


195 


186 


1.046(±0.088) 


0.788(±0.061) 


2.174(±0.293) 


0.848(±0.118) 


HCC(n=200,d=2) 


1000 


222 


215 


1.028(±0.079) 


0.748(±0.051) 


2.157(±0.243) 


0.897(±0.112) 


TS(n=200,d=2) 


1000 


291 


268 


1.053(±0.079) 


0.805(±0.058) 


1.701(±0.186) 


0.851(±0.093) 


INC(n=1000,d=2) 


1000 


127 


117 


0.201{±0.013) 


0.181(±0.012) 


0.366(±0.053) 


0.207(±0.035) 


INC(n=1000,d=10) 


1000 


262 


249 


1.023(±0.035) 


0.902(±0.030) 


1.238(±0.081) 


0.777(±0.054) 


HCC(n=1000,d=10) 


1000 


201 


192 


0.991(±0.033) 


0.902(±0.031) 


1.235(±0.088) 


0.790(±0.067) 


TS(n=1000,d=10) 


1000 


171 


162 


1.009(±0.033) 


0.951(±0.031) 


1.166(±0.098) 


0.825(±0.071) 


INC(n=2000,d=2) 


1000 


80 


77 


0.105(±0.007) 


0.099(±0.006) 


0.214(±0.042) 


0.135(±0.029) 


HCC(n=2000,d=2) 


1000 


44 


42 


0.102(±0.007) 


0.099(±0.007) 


0.187(±0.050) 


0.120(±0.034) 


TS(n=2000,d=2) 


1000 


47 


47 


0.101(±0.007) 


0.099(±0.007) 


0.147(±0.032) 


0.103(±0.026) 


INC(n=2000,d=10) 


1000 


116 


113 


0.511(±0.016) 


0.491(±0.016) 


0.611(±0.052) 


0.437(±0.042) 


HCC(n=2000,d=10) 


1000 


110 


105 


0.500(±0.016) 


0.481(±0.015) 


0.602(±0.056) 


0.430(±0.044) 


TS(n=2000,d=10) 


1000 


101 


98 


0.511(±0.016) 


0.499(±0.016) 


0.601(±0.054) 


0.486(±0.051) 
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Table 3: Comparison of the min-max truncated estimator / with the ordinary least 
squares estimator p°^^^ with the heavy-tailed noise (see Section [3.4.1l) . 
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INC(n=200,d=l) 


1000 


163 


145 


7.72(±3.46) 


3.92(±0.409) 


30.52(±20.8) 


7.20(±1.61) 


INC(n=200,d=2) 


1000 


104 


98 


22.69(±23.14) 


19.18(±23.09) 


45.36(±14.1) 


11.63(±2.19) 


HCC(n=200,d=2) 


1000 


120 


117 


18.16(±12.68) 


8.07(±0.718) 


99.39(±105) 


15.34(±4.41) 


TS(n=200,d=2) 


1000 


110 


105 


43.89(±63.79) 


39.71(±63.76) 


48.55(±18.4) 


10.59(±2.01) 


INC(n=1000,d=2) 


1000 


104 


100 


3.98(±2.25) 


1.78(±0.128) 


23.18(±21.3) 


2.03(±0.56) 


INC(n=1000,d=10) 


1000 


253 


242 


16.36(±5.10) 


7.90(±0.278) 


41.25(±19.8) 


7.81(±0.69) 


HCC(n=1000,d=10) 


1000 


220 


211 


13.57(±1.93) 


7.88(±0.255) 


33.13(±8.2) 


7.28(±0.59) 


TS(n=1000,d=10) 


1000 


214 


211 


18.67(±11.62) 


13.79(±11.52) 


30.34(±7.2) 


7.53(±0.58) 


INC(n=2000,d=2) 


1000 


113 


103 


1.56(±0.41) 


0.89(±0.059) 


6.74(±3.4) 


0.86(±0.18) 


HCC(n=2000,d=2) 


1000 


105 


97 


1.66(±0.43) 


0.95(±0.062) 


7.87(±3.8) 


1.13(±0.23) 


TS(n=2000,d=2) 


1000 


101 


95 


1.59(±0.64) 


0.88(±0.058) 


8.03(±6.2) 


1.04(±0.22) 


INC(n=2000,d=10) 


1000 


259 


255 


8.77(±4.02) 


4.23(±0.154) 


21.54(±15.4) 


4.03(±0.39) 


HCC(n=2000,d=10) 


1000 


250 


242 


6.98(±1.17) 


4.13(±0.127) 


15.35(±4.5) 


3.94(±0.25) 


TS(n=2000,d=10) 


1000 


238 


233 


8.49(±3.61) 


5.95(±3.486) 


14.82(±3.8) 


4.17(±0.30) 
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Table 4: Comparison of the min-max truncated estimator / with the ordinary least 
squares estimator f^°^^^ with an asymetric variant of the heavy-tailed noise. 
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INC(n=200,d=l) 


1000 


87 


77 


5.49(±3.07) 


3.00(±0.330) 


35.44(±34.7) 


6.85(±2.48) 


INC(n=200,d=2) 


1000 


70 


66 


19.25(±23.23) 


17.4(±23.2) 


37.95(±13.1) 


11.05(±2.87) 


HCC(n=200,d=2) 


1000 


67 


66 


7.19(±0.88) 


5.81(±0.397) 


31.52(±10.5) 


10.87(±2.64) 


TS(n=200,d=2) 


1000 


76 


68 


39.80(±64.09) 


37.9(±64.1) 


34.28(±14.8) 


9.21(±2.05) 


INC(n=1000,d=2) 


1000 


101 


92 


2.81(±2.21) 


1.31(±0.106) 


16.76(±21.8) 


1.88(±0.69) 


INC(n=1000,d=10) 


1000 


211 


195 


10.71(±4.53) 


5.86(±0.222) 


29.00(±21.3) 


6.03(±0.71) 


HCC(n=1000,d=10) 


1000 


197 


185 


8.67(±1.16) 


5.81(±0.177) 


20.31(±5.59) 


5.79(±0.43) 


TS(n=1000,d=10) 


1000 


258 


233 


13.62(±11.27) 


11.3(±11.2) 


14.68(±2.45) 


5.60(±0.36) 


INC(n=2000,d=2) 


1000 


106 


92 


1.04(±0.37) 


0.64(±0.042) 


4.54(±3.45) 


0.79(±0.16) 


HCC(n=2000,d=2) 


1000 


99 


90 


0.90(±0.11) 


0.66(±0.042) 


3.23(±0.93) 


0.82(±0.16) 


TS(n=2000,d=2) 


1000 


84 


81 


1.11(±0.66) 


0.60(±0.042) 


6.80(±7.79) 


0.69(±0.17) 


INC(n=2000,d=10) 


1000 


238 


222 


6.32(±4.18) 


3.07(±0.147) 


16.84(±17.5) 


3.18(±0.51) 


HCC(n=2000,d=10) 


1000 


221 


203 


4.49(±0.98) 


2.98(±0.091) 


9.76(±4.39) 


2.93(±0.22) 


TS(n=2000,d=10) 


1000 


412 


350 


5.93(±3.51) 


4.59(±3.44) 


6.07(±1.76) 


2.84(±0.16) 
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Table 5: Comparison of the min-max truncated estimator / with the ordinary least 
squares estimator f'-"^^^ for standard Gaussian noise. 
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nb of iter. 
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INC(n=200,d=l) 


1000 


20 


8 


0.541(±0.048) 


0.541(±0.048) 


0.401(±0.168) 


0.397(±0.167) 


INC(n=200,d=2) 


1000 


1 





1.051(±0.067) 


1.051(±0.067) 


2.566 


2.757 


HCC(n=200,d=2) 


1000 


1 





1.051(±0.067) 


1.051(±0.067) 


2.566 


2.757 


TS(n=200,d=2) 


1000 








1.068(±0.067) 


1.068(±0.067) 






INC(n=1000,d=2) 


1000 








0.203(±0.013) 


0.203(±0.013) 






INC(n=1000,d=10) 


1000 








1.023(±0.029) 


1.023(±0.029) 






HCC(n=1000,d=10) 


1000 








1.023(±0.029) 


1.023(±0.029) 






TS(n=1000,d=10) 


1000 








0.997(±0.028) 


0.997(±0.028) 






INC(n=2000,d=2) 


1000 








0.112(±0.007) 


0.112(±0.007) 






HCC(n=2000,d=2) 


1000 








0.112(±0.007) 


0.112(±0.007) 






TS(n=2000,d=2) 


1000 








0.098(±0.006) 


0.098(±0.006) 






INC(n=2000,d=10) 


1000 








0.517(±0.015) 


0.517(±0.015) 






HCC(n=2000,d=10) 


1000 








0.517(±0.015) 


0.517(±0.015) 






TS(n=2000,d=10) 


1000 








0.501(±0.015) 


0.501(±0.015) 
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Figure 1 : Surrounding points are the points of the training set generated several 
times from TS'(1000, 10) (with the mixture noise with p = 0.005 and p = 0.4) 
that are not taken into account in the min-max truncated estimator (to the extent 
that the estimator would not change by removing simultaneously all these points). 
The min-max truncated estimator x /(x) appears in dash-dot line, while x i— i- 
E(y |X = x) is in solid line. In these six simulations, it outperforms the ordinary 
least squares estimator. 
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Figure 2: Surrounding points are the points of the training set generated several 
times from TS'(200, 2) (with the heavy-tailed noise) that are not taken into account 
in the min-max truncated estimator (to the extent that the estimator would not 
change by removing these points). The min-max truncated estimator x i— )■ f{x) 
appears in dash-dot line, while x ^ E(F|X = x) is in solid line. In these six 
simulations, it outperforms the ordinary least squares estimator. Note that in the 
last figure, it does not consider 64 points among the 200 training points. 
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